lucataco/qwen3-vl-8b-instruct 🖼️🔢📝 → 📝
About
A powerful vision-language model in the Qwen series
Example Output
Prompt:
"What is happening in this video?"
Output
In this video, a monkey is seen riding a skateboard down a sidewalk in a park. The monkey is wearing roller skates on its feet and is pushing itself along the pavement. The camera follows the monkey as it moves forward, capturing its movements and the surrounding environment. People can be seen sitting on benches in the background, enjoying the park. The monkey appears to be enjoying its ride, and the video captures the fun and playful nature of the scene.
Performance Metrics
5.52s
Prediction Time
87.89s
Total Time
All Input Parameters
{
"media": "https://replicate.delivery/pbxt/NvKfW9SS1seJAh8GneR5nzazMg6YxeviTwdTVxzibJ87IfRc/monkey_on_skates.mp4",
"top_p": 0.9,
"prompt": "What is happening in this video?",
"video_fps": 1,
"temperature": 0.7,
"max_new_tokens": 512,
"video_max_pixels": 200704
}
Input Parameters
- media
- Optional image or video file. Supported formats: images (jpg, png, etc.) and videos (mp4, avi, mov, etc.).
- top_p
- Cumulative probability for nucleus sampling.
- prompt
- Instruction or conversation turn for Qwen3-VL.
- video_fps
- Frames per second to sample from video. Only applies to video inputs. Lower values reduce token count.
- temperature
- Sampling temperature; set to 0 for deterministic output.
- max_new_tokens
- Maximum number of new tokens to sample.
- video_max_pixels
- Maximum pixels for video frames. Only applies to video inputs. Controls token count. Set to 0 for auto. Recommended: 50176-786432.
Output Schema
Output
Example Execution Logs
qwen-vl-utils using torchvision to read video.
Version Details
- Version ID
39e893666996acf464cff75688ad49ac95ef54e9f1c688fbc677330acc478e11- Version Created
- October 21, 2025