deepseek-ai/deepseek-vl-7b-base 🖼️📝🔢 → 📝

▶️ 4.7K runs 📅 Mar 2024 ⚙️ Cog 0.9.4 🔗 GitHub 📄 Paper ⚖️ License

image-captioning image-to-text visual-understanding

About

DeepSeek-VL: An open-source Vision-Language Model designed for real-world vision and language understanding applications

Example Output

Prompt:

"Describe each stage of this image"

Output

The image depicts a three-stage process for training a vision-language model.

Stage 1: Training VL Adapter: In this stage, a vision-language adapter is trained using supervised fine-tuning. The adapter is trained on image-text pairs and pure language sequences.
Stage 2: Joint VL Pre-training: In this stage, a joint vision-language model is pre-trained using self-supervised learning. The model is trained on image-text pairs and pure language sequences.
Stage 3: Supervised Fine-tuning: In this stage, the model is fine-tuned on supervised tasks using image-text pairs and pure language sequences.

The model is trained using a hybrid vision-language adapter, which combines a vision-language adapter with a language model. The model is trained on a variety of tasks, including image captioning, visual question answering, and visual reasoning. The model is able to understand the visual content of an image and generate a natural language description or answer.

Performance Metrics

8.43s Prediction Time

82.87s Total Time

All Input Parameters

{
  "image": "https://replicate.delivery/pbxt/KYKAXfcjSZ7uEkwQoYG4SFXRJpbGcdKhWow4Ul0y4Qkjw6wW/training_pipelines.png",
  "prompt": "Describe each stage of this image",
  "max_new_tokens": 512
}

Input Parameters

image (required) Type: string: Input image
prompt Type: stringDefault: Describe this image: Input prompt
max_new_tokens Type: integerDefault: 512: Maximum number of tokens to generate

Output Schema

Output

Type: array • Items Type: string

Version Details

Version ID: d1823e6f68cd3d57f2d315b9357dfa85f53817120ae0de8d2b95fbc8e93a1385
Version Created: March 11, 2024

Run on Replicate →