deepseek-ai/deepseek-vl2 🖼️🔢📝 → 📝
About
DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

Example Output
Prompt:
"Describe each stage of this <image> in detail"
Output
The figure illustrates a three-stage process for training and fine-tuning a Vision-Language (VL) model using the DeepSeek LLM framework. Here's a detailed description of each stage:
Stage 1: Training VL Adaptor
Components Involved:
- DeepSeek LLM: This is the foundational language model used throughout the stages.
- Vision-Language Adaptor: This component is responsible for adapting the language model to handle vision-language tasks.
- Hybrid Vision Encoder: This encoder processes visual data and combines it with textual information.
- SAM-B and SigLIP-L: These are specific models or components within the hybrid vision encoder that contribute to processing image-text pairs.
Process:
- The DeepSeek LLM is used as the base model.
- The Vision-Language Adaptor is trained on image-text pairs, which involves aligning visual and textual information.
- The Hybrid Vision Encoder processes these image-text pairs, integrating both visual and textual features.
Stage 2: Joint VL Pre-training
Components Involved:
- DeepSeek LLM: Continues to be the core language model.
- Vision-Language Adaptor: Further refined through joint pre-training.
- Hybrid Vision Encoder: Enhanced to better handle interleaved vision and language sequences.
- SAM-B and SigLIP-L: Continue to play roles in encoding visual and textual data.
Process:
- The model undergoes joint pre-training using interleaved vision and language sequences.
- This step helps the model learn to effectively combine and process both types of data simultaneously.
- The Vision-Language Adaptor and Hybrid Vision Encoder are further optimized during this phase.
Stage 3: Supervised Finetuning
Components Involved:
- DeepSeek LLM: Now fully integrated into the VL system.
- Vision-Language Adaptor: Fully adapted and ready for specific tasks.
- Hybrid Vision Encoder: Finalized and capable of handling complex VL tasks.
- SAM-B and SigLIP-L: Continue their roles in encoding and processing data.
Process:
- The model is fine-tuned using VL chat data and pure language chat data.
- This supervised finetuning phase refines the model's performance on specific VL tasks, such as conversational understanding and generation.
- The Vision-Language Adaptor and Hybrid Vision Encoder are fine-tuned to ensure they work seamlessly together for the desired outcomes.
Overall, this three-stage process leverages the strengths of the DeepSeek LLM and specialized adaptors to create a robust Vision-Language model capable of handling various tasks involving both visual and textual data.
Performance Metrics
40.04s
Prediction Time
40.05s
Total Time
All Input Parameters
{ "image": "https://replicate.delivery/pbxt/MTtsBStHRqLDgNZMkt0J7PptoJ3lseSUNcGaDkG230ttNJlT/workflow.png", "top_p": 0.9, "prompt": "Describe each stage of this <image> in detail", "temperature": 0.1, "max_length_tokens": 2048, "repetition_penalty": 1.1 }
Input Parameters
- image (required)
- Input image file
- top_p
- Top-p sampling parameter
- prompt (required)
- Text prompt to guide the model
- temperature
- Temperature for sampling
- max_length_tokens
- Maximum number of tokens to generate
- repetition_penalty
- Repetition penalty
Output Schema
Output
Version Details
- Version ID
e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200
- Version Created
- February 11, 2025