cuuupid/qwen2-vl-2b 🖼️🔢📝 → 📝
About
SOTA open-source model for chatting with videos and the newest model in the Qwen family

Example Output
Prompt:
"Describe the video in great detail."
Output
The video features a woman standing behind a podium, speaking to the camera. She is wearing a blue shirt and appears to be giving a presentation or lecture. The woman's facial expression suggests that she is engaged and passionate about the topic she is discussing. The background of the video is not visible, but it can be assumed that it is an indoor setting, possibly a conference room or lecture hall. The woman's speech is not audible, but her body language and gestures suggest that she is using hand movements to emphasize certain points. Overall, the video seems to be a formal presentation or lecture, with the woman as the main speaker.
Performance Metrics
16.38s
Prediction Time
16.39s
Total Time
All Input Parameters
{ "video": "https://replicate.delivery/pbxt/LXVISWYD8Od0I7w6EW5VIO3sycOIcukn6H26wrkaOX95RK7E/dod_classification_training.mp4", "width": 128, "height": 128, "prompt": "Describe the video in great detail.", "max_tokens": 512, "temperature": 0.7, "max_duration": 180 }
Input Parameters
- video (required)
- Video to process
- width
- Width for the video
- height
- Height for the video
- prompt
- Prompt to use for the video
- max_tokens
- Maximum number of tokens to generate
- temperature
- Temperature for the model (0.7 is a good default).
- max_duration
- Maximum duration of the video in seconds (above 360, may run out of VRAM).
- repetition_penalty
- Repetition penalty for the model (1.1 is a good default).
Output Schema
Output
Version Details
- Version ID
b3e77005f19950db4f8564bbedd5670a3e1bd293f1fefd0088d10d8e4d083dd4
- Version Created
- August 31, 2024