deepseek-ai/deepseek-vl2 🖼️🔢📝 → 📝

▶️ 70.5K runs 📅 Feb 2025 ⚙️ Cog 0.13.7 🔗 GitHub 📄 Paper ⚖️ License
image-to-text ocr visual-question-answering visual-understanding

About

DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

Example Output

Prompt:

"Describe each stage of this <image> in detail"

Output

The figure illustrates a three-stage process for training and fine-tuning a Vision-Language (VL) model using the DeepSeek LLM framework. Here's a detailed description of each stage:

Stage 1: Training VL Adaptor

  • Components Involved:

    • DeepSeek LLM: This is the foundational language model used throughout the stages.
    • Vision-Language Adaptor: This component is responsible for adapting the language model to handle vision-language tasks.
    • Hybrid Vision Encoder: This encoder processes visual data and combines it with textual information.
    • SAM-B and SigLIP-L: These are specific models or components within the hybrid vision encoder that contribute to processing image-text pairs.
  • Process:

    • The DeepSeek LLM is used as the base model.
    • The Vision-Language Adaptor is trained on image-text pairs, which involves aligning visual and textual information.
    • The Hybrid Vision Encoder processes these image-text pairs, integrating both visual and textual features.

Stage 2: Joint VL Pre-training

  • Components Involved:

    • DeepSeek LLM: Continues to be the core language model.
    • Vision-Language Adaptor: Further refined through joint pre-training.
    • Hybrid Vision Encoder: Enhanced to better handle interleaved vision and language sequences.
    • SAM-B and SigLIP-L: Continue to play roles in encoding visual and textual data.
  • Process:

    • The model undergoes joint pre-training using interleaved vision and language sequences.
    • This step helps the model learn to effectively combine and process both types of data simultaneously.
    • The Vision-Language Adaptor and Hybrid Vision Encoder are further optimized during this phase.

Stage 3: Supervised Finetuning

  • Components Involved:

    • DeepSeek LLM: Now fully integrated into the VL system.
    • Vision-Language Adaptor: Fully adapted and ready for specific tasks.
    • Hybrid Vision Encoder: Finalized and capable of handling complex VL tasks.
    • SAM-B and SigLIP-L: Continue their roles in encoding and processing data.
  • Process:

    • The model is fine-tuned using VL chat data and pure language chat data.
    • This supervised finetuning phase refines the model's performance on specific VL tasks, such as conversational understanding and generation.
    • The Vision-Language Adaptor and Hybrid Vision Encoder are fine-tuned to ensure they work seamlessly together for the desired outcomes.

Overall, this three-stage process leverages the strengths of the DeepSeek LLM and specialized adaptors to create a robust Vision-Language model capable of handling various tasks involving both visual and textual data.

Performance Metrics

40.04s Prediction Time
40.05s Total Time
All Input Parameters
{
  "image": "https://replicate.delivery/pbxt/MTtsBStHRqLDgNZMkt0J7PptoJ3lseSUNcGaDkG230ttNJlT/workflow.png",
  "top_p": 0.9,
  "prompt": "Describe each stage of this <image> in detail",
  "temperature": 0.1,
  "max_length_tokens": 2048,
  "repetition_penalty": 1.1
}
Input Parameters
image (required) Type: string
Input image file
top_p Type: numberDefault: 0.9Range: 0 - 1
Top-p sampling parameter
prompt (required) Type: string
Text prompt to guide the model
temperature Type: numberDefault: 0.1Range: 0 - 1
Temperature for sampling
max_length_tokens Type: integerDefault: 2048Range: 0 - 4096
Maximum number of tokens to generate
repetition_penalty Type: numberDefault: 1.1Range: 0 - 2
Repetition penalty
Output Schema

Output

Type: string

Version Details
Version ID
e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200
Version Created
February 11, 2025
Run on Replicate →