deepseek-ai/deepseek-vl2 🖼️🔢📝 → 📝

▶️ 70.5K runs 📅 Feb 2025 ⚙️ Cog 0.13.7 🔗 GitHub 📄 Paper ⚖️ License

image-to-text ocr visual-question-answering visual-understanding

About

DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

Example Output

Prompt:

"Describe each stage of this <image> in detail"

Output

The figure illustrates a three-stage process for training and fine-tuning a Vision-Language (VL) model using the DeepSeek LLM framework. Here's a detailed description of each stage:

Stage 1: Training VL Adaptor

Components Involved:
- DeepSeek LLM: This is the foundational language model used throughout the stages.
- Vision-Language Adaptor: This component is responsible for adapting the language model to handle vision-language tasks.
- Hybrid Vision Encoder: This encoder processes visual data and combines it with textual information.
- SAM-B and SigLIP-L: These are specific models or components within the hybrid vision encoder that contribute to processing image-text pairs.
Process:
- The DeepSeek LLM is used as the base model.
- The Vision-Language Adaptor is trained on image-text pairs, which involves aligning visual and textual information.
- The Hybrid Vision Encoder processes these image-text pairs, integrating both visual and textual features.

Stage 2: Joint VL Pre-training

Components Involved:
- DeepSeek LLM: Continues to be the core language model.
- Vision-Language Adaptor: Further refined through joint pre-training.
- Hybrid Vision Encoder: Enhanced to better handle interleaved vision and language sequences.
- SAM-B and SigLIP-L: Continue to play roles in encoding visual and textual data.
Process:
- The model undergoes joint pre-training using interleaved vision and language sequences.
- This step helps the model learn to effectively combine and process both types of data simultaneously.
- The Vision-Language Adaptor and Hybrid Vision Encoder are further optimized during this phase.

Stage 3: Supervised Finetuning

Components Involved:
- DeepSeek LLM: Now fully integrated into the VL system.
- Vision-Language Adaptor: Fully adapted and ready for specific tasks.
- Hybrid Vision Encoder: Finalized and capable of handling complex VL tasks.
- SAM-B and SigLIP-L: Continue their roles in encoding and processing data.
Process:
- The model is fine-tuned using VL chat data and pure language chat data.
- This supervised finetuning phase refines the model's performance on specific VL tasks, such as conversational understanding and generation.
- The Vision-Language Adaptor and Hybrid Vision Encoder are fine-tuned to ensure they work seamlessly together for the desired outcomes.

Overall, this three-stage process leverages the strengths of the DeepSeek LLM and specialized adaptors to create a robust Vision-Language model capable of handling various tasks involving both visual and textual data.

Performance Metrics

40.04s Prediction Time

40.05s Total Time

All Input Parameters

{
  "image": "https://replicate.delivery/pbxt/MTtsBStHRqLDgNZMkt0J7PptoJ3lseSUNcGaDkG230ttNJlT/workflow.png",
  "top_p": 0.9,
  "prompt": "Describe each stage of this <image> in detail",
  "temperature": 0.1,
  "max_length_tokens": 2048,
  "repetition_penalty": 1.1
}

Input Parameters

image (required) Type: string: Input image file
top_p Type: numberDefault: 0.9Range: 0 - 1: Top-p sampling parameter
prompt (required) Type: string: Text prompt to guide the model
temperature Type: numberDefault: 0.1Range: 0 - 1: Temperature for sampling
max_length_tokens Type: integerDefault: 2048Range: 0 - 4096: Maximum number of tokens to generate
repetition_penalty Type: numberDefault: 1.1Range: 0 - 2: Repetition penalty

Output Schema

Output

Type: string

Version Details

Version ID: e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200
Version Created: February 11, 2025

Run on Replicate →