deepseek-ai/deepseek-vl2-small 🖼️🔢📝 → 📝
About
DeepSeek-VL2-small, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

Example Output
Prompt:
"Describe each stage of this <image> in detail"
Output
The diagram illustrates a three-stage process for training and fine-tuning a model using a combination of vision-language (VL) adaptation, pre-training, and supervised finetuning. Here's a detailed breakdown of each stage:
Stage 1: Training VL Adaptor
Components Involved:
- DeepSeek LLM: This is the large language model that will be adapted to work with visual data.
- Vision-Language Adaptor: A module designed to bridge the gap between visual inputs and the language model.
- Hybrid Vision Encoder: This encoder processes visual information from images and text pairs. It consists of two components:
- SAM-B: Likely a component related to spatial attention mechanisms.
- SigLIP-L: Possibly a component related to signal processing or lip reading.
Process:
- The hybrid vision encoder takes image-text pairs as input.
- The vision-language adaptor then uses these processed visual features to interact with the DeepSeek LLM.
Stage 2: Joint VL Pre-training
Components Involved:
- Same as Stage 1, but now the system is being trained on interleaved VL + pure language sequences.
Process:
- The hybrid vision encoder continues to process visual information.
- The vision-language adaptor integrates these visual features with the DeepSeek LLM.
- The model is pre-trained on a dataset that includes both visual-linguistic and purely linguistic sequences.
Stage 3: Supervised Finetuning
Components Involved:
- Same as Stage 1 and Stage 2, but now the system is being fine-tuned on VL chat data + pure language chat data.
Process:
- The hybrid vision encoder processes the new dataset.
- The vision-language adaptor combines the visual features with the DeepSeek LLM.
- The model undergoes supervised finetuning to improve its performance on specific tasks, such as understanding and generating responses in VL chat contexts.
Summary
The overall process involves:
- Training the VL Adaptor: Using image-text pairs to train the vision-language adaptor and hybrid vision encoder.
- Joint Pre-training: Integrating visual and linguistic information to pre-train the model on a mixed dataset.
- Supervised Finetuning: Fine-tuning the model on specialized VL chat data to enhance its capabilities in handling conversational tasks.
Performance Metrics
20.59s
Prediction Time
106.44s
Total Time
All Input Parameters
{ "image": "https://replicate.delivery/pbxt/MTtq4AbrRWL05upjmLYEI1JNyjVYYZv7CDuZ0PgzDtfMegYO/workflow.png", "top_p": 0.9, "prompt": "Describe each stage of this <image> in detail", "temperature": 0.1, "max_length_tokens": 2048, "repetition_penalty": 1.1 }
Input Parameters
- image (required)
- Input image file
- top_p
- Top-p sampling parameter
- prompt (required)
- Text prompt to guide the model
- temperature
- Temperature for sampling
- max_length_tokens
- Maximum number of tokens to generate
- repetition_penalty
- Repetition penalty
Output Schema
Output
Example Execution Logs
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Version Details
- Version ID
965f877546f4f0e3f5ae59847f27079d66da81fff2d87d753d5495e862dcb3a4
- Version Created
- February 11, 2025