chenxwh/deepseek-vl2 ๐๐ข๐ผ๏ธ โ โ
About
Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Example Output
Output
{"img_out":null,"text_out":"The figure illustrates the three-stage training process for a Vision-Language (VL) model using the DeepSeek LLM. Here's a detailed description of each stage:
Stage 1: Training VL Adaptor
Components Involved:
- DeepSeek LLM: This is the large language model that serves as the backbone for the entire process.
- Vision-Language Adaptor: This component is responsible for adapting the vision and language models to work together effectively.
- Hybrid Vision Encoder: This encoder processes visual data and converts it into a format suitable for the model. It uses two specific models, SAM-B and SigLIP-L, which are pre-trained on image-text pairs.
- Image-Text Pairs: These pairs are used to train the hybrid vision encoder. The encoder learns to understand and generate text based on visual inputs.
Process:
- The hybrid vision encoder is trained using image-text pairs. This training helps the encoder learn how to map images to corresponding text descriptions.
Stage 2: Joint VL Pre-training
Components Involved:
- DeepSeek LLM: Continues to be the central model.
- Vision-Language Adaptor: Continues to adapt the vision and language components.
- Hybrid Vision Encoder: Continues to process visual data.
- Interleaved VL + Pure Language Sequences: In this stage, both vision-language sequences and pure language sequences are interleaved. This means that the model is exposed to both types of data during training, allowing it to learn from a more diverse set of examples.
Process:
- The model is pre-trained using a combination of vision-language sequences and pure language sequences. This joint training helps the model develop a better understanding of both visual and textual information.
Stage 3: Supervised Finetuning
Components Involved:
- DeepSeek LLM: Continues to be the central model.
- Vision-Language Adaptor: Continues to refine the adaptation between vision and language.
- Hybrid Vision Encoder: Continues to process visual data.
- VL Chat Data + Pure Language Chat Data: In this final stage, the model is fine-tuned using both vision-language chat data and pure language chat data. This step focuses on improving the model's performance specifically in generating responses to chat queries.
Process:
- The model undergoes supervised finetuning with a focus on chat tasks. The finetuning process involves adjusting the model parameters to optimize its performance in generating coherent and contextually relevant responses to user queries.
Summary
The figure outlines a comprehensive approach to training a Vision-Language model using the DeepSeek LLM. Starting with training the VL adaptor and hybrid vision encoder on image-text pairs, followed by joint pre-training with interleaved sequences, and finally, supervised finetuning with chat data, the model is progressively refined to enhance its ability to handle complex vision-language tasks.<๏ฝendโofโsentence๏ฝ>"}
Performance Metrics
All Input Parameters
{ "text": "Describe each stage of this image in detail", "top_p": 0.9, "image1": "https://raw.githubusercontent.com/deepseek-ai/DeepSeek-VL2/refs/heads/main/images/vqa_1.jpg", "temperature": 0.1, "max_new_tokens": 2048, "repetition_penalty": 1.1 }
Input Parameters
- text
- Input text.
- top_p
- If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
- image1 (required)
- First image
- image2
- Optional, second image for multiple images image2text
- image3
- Optional, third image for multiple images image2text
- temperature
- The value used to modulate the probabilities of the next token. Set the temperature to 0 for deterministic generation
- max_new_tokens
- The maximum numbers of tokens to generate
- repetition_penalty
- Repetition penalty
Output Schema
Example Execution Logs
2024-12-29 15:48:34,853 - gradio_logger - INFO - ================================================================================ 2024-12-29 15:48:34,853 - gradio_logger - INFO - <|User|>: <image> Describe each stage of this image in detail <|Assistant|>: [{'role': '<|User|>', 'content': '<image>\nDescribe each stage of this image in detail', 'images': [<PIL.Image.Image image mode=RGB size=5014x2004 at 0x7251DA80DFD0>]}, {'role': '<|Assistant|>', 'content': ''}] You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Version Details
- Version ID
8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
- Version Created
- December 29, 2024