chenxwh/deepseek-vl2 ๐Ÿ“๐Ÿ”ข๐Ÿ–ผ๏ธ โ†’ โ“

โ–ถ๏ธ 1.1K runs ๐Ÿ“… Dec 2024 โš™๏ธ Cog 0.9.23 ๐Ÿ”— GitHub ๐Ÿ“„ Paper โš–๏ธ License
image-to-text ocr visual-question-answering

About

Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Example Output

Output

{"img_out":null,"text_out":"The figure illustrates the three-stage training process for a Vision-Language (VL) model using the DeepSeek LLM. Here's a detailed description of each stage:

Stage 1: Training VL Adaptor

  • Components Involved:

    • DeepSeek LLM: This is the large language model that serves as the backbone for the entire process.
    • Vision-Language Adaptor: This component is responsible for adapting the vision and language models to work together effectively.
    • Hybrid Vision Encoder: This encoder processes visual data and converts it into a format suitable for the model. It uses two specific models, SAM-B and SigLIP-L, which are pre-trained on image-text pairs.
    • Image-Text Pairs: These pairs are used to train the hybrid vision encoder. The encoder learns to understand and generate text based on visual inputs.
  • Process:

    • The hybrid vision encoder is trained using image-text pairs. This training helps the encoder learn how to map images to corresponding text descriptions.

Stage 2: Joint VL Pre-training

  • Components Involved:

    • DeepSeek LLM: Continues to be the central model.
    • Vision-Language Adaptor: Continues to adapt the vision and language components.
    • Hybrid Vision Encoder: Continues to process visual data.
    • Interleaved VL + Pure Language Sequences: In this stage, both vision-language sequences and pure language sequences are interleaved. This means that the model is exposed to both types of data during training, allowing it to learn from a more diverse set of examples.
  • Process:

    • The model is pre-trained using a combination of vision-language sequences and pure language sequences. This joint training helps the model develop a better understanding of both visual and textual information.

Stage 3: Supervised Finetuning

  • Components Involved:

    • DeepSeek LLM: Continues to be the central model.
    • Vision-Language Adaptor: Continues to refine the adaptation between vision and language.
    • Hybrid Vision Encoder: Continues to process visual data.
    • VL Chat Data + Pure Language Chat Data: In this final stage, the model is fine-tuned using both vision-language chat data and pure language chat data. This step focuses on improving the model's performance specifically in generating responses to chat queries.
  • Process:

    • The model undergoes supervised finetuning with a focus on chat tasks. The finetuning process involves adjusting the model parameters to optimize its performance in generating coherent and contextually relevant responses to user queries.

Summary

The figure outlines a comprehensive approach to training a Vision-Language model using the DeepSeek LLM. Starting with training the VL adaptor and hybrid vision encoder on image-text pairs, followed by joint pre-training with interleaved sequences, and finally, supervised finetuning with chat data, the model is progressively refined to enhance its ability to handle complex vision-language tasks.<๏ฝœendโ–ofโ–sentence๏ฝœ>"}

Performance Metrics

139.02s Prediction Time
224.54s Total Time
All Input Parameters
{
  "text": "Describe each stage of this image in detail",
  "top_p": 0.9,
  "image1": "https://raw.githubusercontent.com/deepseek-ai/DeepSeek-VL2/refs/heads/main/images/vqa_1.jpg",
  "temperature": 0.1,
  "max_new_tokens": 2048,
  "repetition_penalty": 1.1
}
Input Parameters
text Type: stringDefault: Describe this image.
Input text.
top_p Type: numberDefault: 0.9
If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
image1 (required) Type: string
First image
image2 Type: string
Optional, second image for multiple images image2text
image3 Type: string
Optional, third image for multiple images image2text
temperature Type: numberDefault: 0.1
The value used to modulate the probabilities of the next token. Set the temperature to 0 for deterministic generation
max_new_tokens Type: integerDefault: 2048Range: 0 - 4096
The maximum numbers of tokens to generate
repetition_penalty Type: numberDefault: 1.1Range: 0 - 2
Repetition penalty
Output Schema
Example Execution Logs
2024-12-29 15:48:34,853 - gradio_logger - INFO - ================================================================================
2024-12-29 15:48:34,853 - gradio_logger - INFO - <|User|>: <image>
Describe each stage of this image in detail
<|Assistant|>:
[{'role': '<|User|>', 'content': '<image>\nDescribe each stage of this image in detail', 'images': [<PIL.Image.Image image mode=RGB size=5014x2004 at 0x7251DA80DFD0>]}, {'role': '<|Assistant|>', 'content': ''}]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Version Details
Version ID
8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
Version Created
December 29, 2024
Run on Replicate โ†’