deepseek-ai/deepseek-vl-7b-base 🖼️📝🔢 → 📝
About
DeepSeek-VL: An open-source Vision-Language Model designed for real-world vision and language understanding applications

Example Output
Prompt:
"Describe each stage of this image"
Output
The image depicts a three-stage process for training a vision-language model.
- Stage 1: Training VL Adapter: In this stage, a vision-language adapter is trained using supervised fine-tuning. The adapter is trained on image-text pairs and pure language sequences.
- Stage 2: Joint VL Pre-training: In this stage, a joint vision-language model is pre-trained using self-supervised learning. The model is trained on image-text pairs and pure language sequences.
- Stage 3: Supervised Fine-tuning: In this stage, the model is fine-tuned on supervised tasks using image-text pairs and pure language sequences.
The model is trained using a hybrid vision-language adapter, which combines a vision-language adapter with a language model. The model is trained on a variety of tasks, including image captioning, visual question answering, and visual reasoning. The model is able to understand the visual content of an image and generate a natural language description or answer.
Performance Metrics
8.43s
Prediction Time
82.87s
Total Time
All Input Parameters
{ "image": "https://replicate.delivery/pbxt/KYKAXfcjSZ7uEkwQoYG4SFXRJpbGcdKhWow4Ul0y4Qkjw6wW/training_pipelines.png", "prompt": "Describe each stage of this image", "max_new_tokens": 512 }
Input Parameters
- image (required)
- Input image
- prompt
- Input prompt
- max_new_tokens
- Maximum number of tokens to generate
Output Schema
Output
Version Details
- Version ID
d1823e6f68cd3d57f2d315b9357dfa85f53817120ae0de8d2b95fbc8e93a1385
- Version Created
- March 11, 2024