lucataco/internvl3_5-30b 🔢🖼️📝✓ → 📝

▶️ 19.3K runs 📅 Oct 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License
image-analysis image-captioning image-to-text video-to-text visual-understanding

About

An advanced open-source multimodal large language model from the InternVL3.5 family, specializing in versatile vision-language tasks, and enhanced reasoning

Example Output

Prompt:

"Please describe this video in detail"

Output

In a sunlit park with lush greenery and people in the background, a monkey is seen riding a miniature skateboard. The skateboard has a vibrant design with red wheels and a black deck. The monkey appears focused, skillfully balancing and propelling itself forward. The environment is lively yet relaxed, with people sitting under umbrellas and enjoying the park. The light is bright, indicating daytime, casting soft shadows on the pavement. The monkey's fur is a mix of light brown and beige, and it seems to be enjoying the ride, navigating the skateboard with apparent ease. The path is clear, surrounded by trees and benches, creating a serene atmosphere. The colors are vivid, with the green of the trees contrasting against the warm tones of the monkey and the colorful skateboard.

Performance Metrics

28.18s Prediction Time
28.19s Total Time
All Input Parameters
{
  "top_k": 50,
  "top_p": 0.95,
  "video": "https://replicate.delivery/pbxt/NviKBcXqpiFp2k6G0h8LALACiuE6vkw9T7oXb2ycpvQ6Bsma/monkey_on_skates.mp4",
  "prompt": "Please describe this video in detail",
  "num_beams": 1,
  "max_frames": 8,
  "temperature": 0.2,
  "thinking_mode": false,
  "max_new_tokens": 1024,
  "repetition_penalty": 1
}
Input Parameters
top_k Type: integerDefault: 50Range: 1 - ∞
Top-k sampling parameter
top_p Type: numberDefault: 0.95Range: 0 - 1
Top-p sampling parameter
video Type: string
Input video file for video understanding (alternative to images)
images Type: string
Input image(s) - can be a single image or multiple images as a list (for multi-image understanding)
prompt Type: stringDefault: Please describe this image in detail.
Text prompt/question about the image(s) or video
num_beams Type: integerDefault: 1Range: 1 - 5
Number of beams for beam search (1 = no beam search)
max_frames Type: integerDefault: 8Range: 1 - 32
Maximum number of frames to extract from video (evenly spaced)
temperature Type: numberDefault: 0.2Range: 0 - 2
Temperature for sampling (higher = more random)
thinking_mode Type: booleanDefault: false
Enable thinking mode for step-by-step reasoning
max_new_tokens Type: integerDefault: 1024Range: 1 - 8192
Maximum number of tokens to generate
repetition_penalty Type: numberDefault: 1Range: 1 - 2
Repetition penalty (1.0 = no penalty)
Output Schema

Output

Type: string

Example Execution Logs
Processing video: /tmp/tmpqy2x4p3imonkey_on_skates.mp4
Extracted 8 frames from video
Processing 8 image(s)...
Generating response with max_new_tokens=1024, temperature=0.2...
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Generated response...
Version Details
Version ID
92713bb56e2e4827477501fc512d80d7287bc34e4ccb7c306afd9bdaef8b7eeb
Version Created
October 22, 2025
Run on Replicate →