deepseek-ai/deepseek-vl2-small 🖼️🔢📝 → 📝

▶️ 1.4K runs 📅 Feb 2025 ⚙️ Cog 0.13.7 🔗 GitHub 📄 Paper ⚖️ License
image-analysis image-captioning image-to-text ocr visual-understanding

About

DeepSeek-VL2-small, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

Example Output

Prompt:

"Describe each stage of this <image> in detail"

Output

The diagram illustrates a three-stage process for training and fine-tuning a model using a combination of vision-language (VL) adaptation, pre-training, and supervised finetuning. Here's a detailed breakdown of each stage:

Stage 1: Training VL Adaptor

  • Components Involved:

    • DeepSeek LLM: This is the large language model that will be adapted to work with visual data.
    • Vision-Language Adaptor: A module designed to bridge the gap between visual inputs and the language model.
    • Hybrid Vision Encoder: This encoder processes visual information from images and text pairs. It consists of two components:
      • SAM-B: Likely a component related to spatial attention mechanisms.
      • SigLIP-L: Possibly a component related to signal processing or lip reading.
  • Process:

    • The hybrid vision encoder takes image-text pairs as input.
    • The vision-language adaptor then uses these processed visual features to interact with the DeepSeek LLM.

Stage 2: Joint VL Pre-training

  • Components Involved:

    • Same as Stage 1, but now the system is being trained on interleaved VL + pure language sequences.
  • Process:

    • The hybrid vision encoder continues to process visual information.
    • The vision-language adaptor integrates these visual features with the DeepSeek LLM.
    • The model is pre-trained on a dataset that includes both visual-linguistic and purely linguistic sequences.

Stage 3: Supervised Finetuning

  • Components Involved:

    • Same as Stage 1 and Stage 2, but now the system is being fine-tuned on VL chat data + pure language chat data.
  • Process:

    • The hybrid vision encoder processes the new dataset.
    • The vision-language adaptor combines the visual features with the DeepSeek LLM.
    • The model undergoes supervised finetuning to improve its performance on specific tasks, such as understanding and generating responses in VL chat contexts.

Summary

The overall process involves:

  1. Training the VL Adaptor: Using image-text pairs to train the vision-language adaptor and hybrid vision encoder.
  2. Joint Pre-training: Integrating visual and linguistic information to pre-train the model on a mixed dataset.
  3. Supervised Finetuning: Fine-tuning the model on specialized VL chat data to enhance its capabilities in handling conversational tasks.

Performance Metrics

20.59s Prediction Time
106.44s Total Time
All Input Parameters
{
  "image": "https://replicate.delivery/pbxt/MTtq4AbrRWL05upjmLYEI1JNyjVYYZv7CDuZ0PgzDtfMegYO/workflow.png",
  "top_p": 0.9,
  "prompt": "Describe each stage of this <image> in detail",
  "temperature": 0.1,
  "max_length_tokens": 2048,
  "repetition_penalty": 1.1
}
Input Parameters
image (required) Type: string
Input image file
top_p Type: numberDefault: 0.9Range: 0 - 1
Top-p sampling parameter
prompt (required) Type: string
Text prompt to guide the model
temperature Type: numberDefault: 0.1Range: 0 - 1
Temperature for sampling
max_length_tokens Type: integerDefault: 2048Range: 0 - 4096
Maximum number of tokens to generate
repetition_penalty Type: numberDefault: 1.1Range: 0 - 2
Repetition penalty
Output Schema

Output

Type: string

Example Execution Logs
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Version Details
Version ID
965f877546f4f0e3f5ae59847f27079d66da81fff2d87d753d5495e862dcb3a4
Version Created
February 11, 2025
Run on Replicate →