deepseek-ai/deepseek-vl-7b-base 🖼️📝🔢 → 📝

▶️ 4.7K runs 📅 Mar 2024 ⚙️ Cog 0.9.4 🔗 GitHub 📄 Paper ⚖️ License
image-captioning image-to-text visual-understanding

About

DeepSeek-VL: An open-source Vision-Language Model designed for real-world vision and language understanding applications

Example Output

Prompt:

"Describe each stage of this image"

Output

The image depicts a three-stage process for training a vision-language model.

  1. Stage 1: Training VL Adapter: In this stage, a vision-language adapter is trained using supervised fine-tuning. The adapter is trained on image-text pairs and pure language sequences.
  2. Stage 2: Joint VL Pre-training: In this stage, a joint vision-language model is pre-trained using self-supervised learning. The model is trained on image-text pairs and pure language sequences.
  3. Stage 3: Supervised Fine-tuning: In this stage, the model is fine-tuned on supervised tasks using image-text pairs and pure language sequences.

The model is trained using a hybrid vision-language adapter, which combines a vision-language adapter with a language model. The model is trained on a variety of tasks, including image captioning, visual question answering, and visual reasoning. The model is able to understand the visual content of an image and generate a natural language description or answer.

Performance Metrics

8.43s Prediction Time
82.87s Total Time
All Input Parameters
{
  "image": "https://replicate.delivery/pbxt/KYKAXfcjSZ7uEkwQoYG4SFXRJpbGcdKhWow4Ul0y4Qkjw6wW/training_pipelines.png",
  "prompt": "Describe each stage of this image",
  "max_new_tokens": 512
}
Input Parameters
image (required) Type: string
Input image
prompt Type: stringDefault: Describe this image
Input prompt
max_new_tokens Type: integerDefault: 512
Maximum number of tokens to generate
Output Schema

Output

Type: arrayItems Type: string

Version Details
Version ID
d1823e6f68cd3d57f2d315b9357dfa85f53817120ae0de8d2b95fbc8e93a1385
Version Created
March 11, 2024
Run on Replicate →