lucataco/qwen3-vl-8b-instruct 🖼️🔢📝 → 📝

▶️ 54.4K runs 📅 Oct 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License

image-analysis image-captioning image-to-text video-to-text visual-understanding

About

A powerful vision-language model in the Qwen series

Example Output

Prompt:

"What is happening in this video?"

Output

In this video, a monkey is seen riding a skateboard down a sidewalk in a park. The monkey is wearing roller skates on its feet and is pushing itself along the pavement. The camera follows the monkey as it moves forward, capturing its movements and the surrounding environment. People can be seen sitting on benches in the background, enjoying the park. The monkey appears to be enjoying its ride, and the video captures the fun and playful nature of the scene.

Performance Metrics

5.52s Prediction Time

87.89s Total Time

All Input Parameters

{
  "media": "https://replicate.delivery/pbxt/NvKfW9SS1seJAh8GneR5nzazMg6YxeviTwdTVxzibJ87IfRc/monkey_on_skates.mp4",
  "top_p": 0.9,
  "prompt": "What is happening in this video?",
  "video_fps": 1,
  "temperature": 0.7,
  "max_new_tokens": 512,
  "video_max_pixels": 200704
}

Input Parameters

media Type: string: Optional image or video file. Supported formats: images (jpg, png, etc.) and videos (mp4, avi, mov, etc.).
top_p Type: numberDefault: 0.9Range: 0 - 1: Cumulative probability for nucleus sampling.
prompt Type: stringDefault: Describe what is happening in the media content: Instruction or conversation turn for Qwen3-VL.
video_fps Type: numberDefault: 1Range: 0.1 - 10: Frames per second to sample from video. Only applies to video inputs. Lower values reduce token count.
temperature Type: numberDefault: 0.7Range: 0 - 2: Sampling temperature; set to 0 for deterministic output.
max_new_tokens Type: integerDefault: 512Range: 1 - 4096: Maximum number of new tokens to sample.
video_max_pixels Type: integerDefault: 200704Range: 0 - 1048576: Maximum pixels for video frames. Only applies to video inputs. Controls token count. Set to 0 for auto. Recommended: 50176-786432.

Output Schema

Output

Type: string

Example Execution Logs

qwen-vl-utils using torchvision to read video.

Version Details

Version ID: 39e893666996acf464cff75688ad49ac95ef54e9f1c688fbc677330acc478e11
Version Created: October 21, 2025

Run on Replicate →