deepseek-ai/deepseek-vl2-small 🖼️🔢📝 → 📝

▶️ 1.4K runs 📅 Feb 2025 ⚙️ Cog 0.13.7 🔗 GitHub 📄 Paper ⚖️ License

image-analysis image-captioning image-to-text ocr visual-understanding

About

DeepSeek-VL2-small, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

Example Output

Prompt:

"Describe each stage of this <image> in detail"

Output

The diagram illustrates a three-stage process for training and fine-tuning a model using a combination of vision-language (VL) adaptation, pre-training, and supervised finetuning. Here's a detailed breakdown of each stage:

Stage 1: Training VL Adaptor

Components Involved:
- DeepSeek LLM: This is the large language model that will be adapted to work with visual data.
- Vision-Language Adaptor: A module designed to bridge the gap between visual inputs and the language model.
- Hybrid Vision Encoder: This encoder processes visual information from images and text pairs. It consists of two components:
  - SAM-B: Likely a component related to spatial attention mechanisms.
  - SigLIP-L: Possibly a component related to signal processing or lip reading.
Process:
- The hybrid vision encoder takes image-text pairs as input.
- The vision-language adaptor then uses these processed visual features to interact with the DeepSeek LLM.

Stage 2: Joint VL Pre-training

Components Involved:
- Same as Stage 1, but now the system is being trained on interleaved VL + pure language sequences.
Process:
- The hybrid vision encoder continues to process visual information.
- The vision-language adaptor integrates these visual features with the DeepSeek LLM.
- The model is pre-trained on a dataset that includes both visual-linguistic and purely linguistic sequences.

Stage 3: Supervised Finetuning

Components Involved:
- Same as Stage 1 and Stage 2, but now the system is being fine-tuned on VL chat data + pure language chat data.
Process:
- The hybrid vision encoder processes the new dataset.
- The vision-language adaptor combines the visual features with the DeepSeek LLM.
- The model undergoes supervised finetuning to improve its performance on specific tasks, such as understanding and generating responses in VL chat contexts.

Summary

The overall process involves:

Training the VL Adaptor: Using image-text pairs to train the vision-language adaptor and hybrid vision encoder.
Joint Pre-training: Integrating visual and linguistic information to pre-train the model on a mixed dataset.
Supervised Finetuning: Fine-tuning the model on specialized VL chat data to enhance its capabilities in handling conversational tasks.

Performance Metrics

20.59s Prediction Time

106.44s Total Time

All Input Parameters

{
  "image": "https://replicate.delivery/pbxt/MTtq4AbrRWL05upjmLYEI1JNyjVYYZv7CDuZ0PgzDtfMegYO/workflow.png",
  "top_p": 0.9,
  "prompt": "Describe each stage of this <image> in detail",
  "temperature": 0.1,
  "max_length_tokens": 2048,
  "repetition_penalty": 1.1
}

Input Parameters

image (required) Type: string: Input image file
top_p Type: numberDefault: 0.9Range: 0 - 1: Top-p sampling parameter
prompt (required) Type: string: Text prompt to guide the model
temperature Type: numberDefault: 0.1Range: 0 - 1: Temperature for sampling
max_length_tokens Type: integerDefault: 2048Range: 0 - 4096: Maximum number of tokens to generate
repetition_penalty Type: numberDefault: 1.1Range: 0 - 2: Repetition penalty

Output Schema

Output

Type: string

Example Execution Logs

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

Version Details

Version ID: 965f877546f4f0e3f5ae59847f27079d66da81fff2d87d753d5495e862dcb3a4
Version Created: February 11, 2025

Run on Replicate →