chenxwh/deepseek-vl2 📝🔢🖼️ → ❓

▶️ 1.1K runs 📅 Dec 2024 ⚙️ Cog 0.9.23 🔗 GitHub 📄 Paper ⚖️ License

image-to-text ocr visual-question-answering

About

Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Example Output

Output

{"img_out":null,"text_out":"The figure illustrates the three-stage training process for a Vision-Language (VL) model using the DeepSeek LLM. Here's a detailed description of each stage:

Stage 1: Training VL Adaptor

Components Involved:
- DeepSeek LLM: This is the large language model that serves as the backbone for the entire process.
- Vision-Language Adaptor: This component is responsible for adapting the vision and language models to work together effectively.
- Hybrid Vision Encoder: This encoder processes visual data and converts it into a format suitable for the model. It uses two specific models, SAM-B and SigLIP-L, which are pre-trained on image-text pairs.
- Image-Text Pairs: These pairs are used to train the hybrid vision encoder. The encoder learns to understand and generate text based on visual inputs.
Process:
- The hybrid vision encoder is trained using image-text pairs. This training helps the encoder learn how to map images to corresponding text descriptions.

Stage 2: Joint VL Pre-training

Components Involved:
- DeepSeek LLM: Continues to be the central model.
- Vision-Language Adaptor: Continues to adapt the vision and language components.
- Hybrid Vision Encoder: Continues to process visual data.
- Interleaved VL + Pure Language Sequences: In this stage, both vision-language sequences and pure language sequences are interleaved. This means that the model is exposed to both types of data during training, allowing it to learn from a more diverse set of examples.
Process:
- The model is pre-trained using a combination of vision-language sequences and pure language sequences. This joint training helps the model develop a better understanding of both visual and textual information.

Stage 3: Supervised Finetuning

Components Involved:
- DeepSeek LLM: Continues to be the central model.
- Vision-Language Adaptor: Continues to refine the adaptation between vision and language.
- Hybrid Vision Encoder: Continues to process visual data.
- VL Chat Data + Pure Language Chat Data: In this final stage, the model is fine-tuned using both vision-language chat data and pure language chat data. This step focuses on improving the model's performance specifically in generating responses to chat queries.
Process:
- The model undergoes supervised finetuning with a focus on chat tasks. The finetuning process involves adjusting the model parameters to optimize its performance in generating coherent and contextually relevant responses to user queries.

Summary

The figure outlines a comprehensive approach to training a Vision-Language model using the DeepSeek LLM. Starting with training the VL adaptor and hybrid vision encoder on image-text pairs, followed by joint pre-training with interleaved sequences, and finally, supervised finetuning with chat data, the model is progressively refined to enhance its ability to handle complex vision-language tasks.<｜end▁of▁sentence｜>"}

Performance Metrics

139.02s Prediction Time

224.54s Total Time

All Input Parameters

{
  "text": "Describe each stage of this image in detail",
  "top_p": 0.9,
  "image1": "https://raw.githubusercontent.com/deepseek-ai/DeepSeek-VL2/refs/heads/main/images/vqa_1.jpg",
  "temperature": 0.1,
  "max_new_tokens": 2048,
  "repetition_penalty": 1.1
}

Input Parameters

text Type: stringDefault: Describe this image.: Input text.
top_p Type: numberDefault: 0.9: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
image1 (required) Type: string: First image
image2 Type: string: Optional, second image for multiple images image2text
image3 Type: string: Optional, third image for multiple images image2text
temperature Type: numberDefault: 0.1: The value used to modulate the probabilities of the next token. Set the temperature to 0 for deterministic generation
max_new_tokens Type: integerDefault: 2048Range: 0 - 4096: The maximum numbers of tokens to generate
repetition_penalty Type: numberDefault: 1.1Range: 0 - 2: Repetition penalty

Output Schema

Example Execution Logs

2024-12-29 15:48:34,853 - gradio_logger - INFO - ================================================================================
2024-12-29 15:48:34,853 - gradio_logger - INFO - <|User|>: <image>
Describe each stage of this image in detail
<|Assistant|>:
[{'role': '<|User|>', 'content': '<image>\nDescribe each stage of this image in detail', 'images': [<PIL.Image.Image image mode=RGB size=5014x2004 at 0x7251DA80DFD0>]}, {'role': '<|Assistant|>', 'content': ''}]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

Version Details

Version ID: 8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
Version Created: December 29, 2024

Run on Replicate →