lucataco/internvl3_5-30b 🔢🖼️📝✓ → 📝

▶️ 23.2K runs 📅 Oct 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License

image-analysis image-captioning image-to-text video-to-text visual-understanding

Performance

28.2sTypical run time

23.2KTotal runs

About

An advanced open-source multimodal large language model from the InternVL3.5 family, specializing in versatile vision-language tasks, and enhanced reasoning

Example Output

Prompt:

"Please describe this video in detail"

Output

In a sunlit park with lush greenery and people in the background, a monkey is seen riding a miniature skateboard. The skateboard has a vibrant design with red wheels and a black deck. The monkey appears focused, skillfully balancing and propelling itself forward. The environment is lively yet relaxed, with people sitting under umbrellas and enjoying the park. The light is bright, indicating daytime, casting soft shadows on the pavement. The monkey's fur is a mix of light brown and beige, and it seems to be enjoying the ride, navigating the skateboard with apparent ease. The path is clear, surrounded by trees and benches, creating a serene atmosphere. The colors are vivid, with the green of the trees contrasting against the warm tones of the monkey and the colorful skateboard.

Performance Metrics

28.18s Prediction Time

28.19s Total Time

All Input Parameters

{
  "top_k": 50,
  "top_p": 0.95,
  "video": "https://replicate.delivery/pbxt/NviKBcXqpiFp2k6G0h8LALACiuE6vkw9T7oXb2ycpvQ6Bsma/monkey_on_skates.mp4",
  "prompt": "Please describe this video in detail",
  "num_beams": 1,
  "max_frames": 8,
  "temperature": 0.2,
  "thinking_mode": false,
  "max_new_tokens": 1024,
  "repetition_penalty": 1
}

Input Parameters

top_k Type: integerDefault: 50Range: 1 - ∞: Top-k sampling parameter
top_p Type: numberDefault: 0.95Range: 0 - 1: Top-p sampling parameter
video Type: string: Input video file for video understanding (alternative to images)
images Type: string: Input image(s) - can be a single image or multiple images as a list (for multi-image understanding)
prompt Type: stringDefault: Please describe this image in detail.: Text prompt/question about the image(s) or video
num_beams Type: integerDefault: 1Range: 1 - 5: Number of beams for beam search (1 = no beam search)
max_frames Type: integerDefault: 8Range: 1 - 32: Maximum number of frames to extract from video (evenly spaced)
temperature Type: numberDefault: 0.2Range: 0 - 2: Temperature for sampling (higher = more random)
thinking_mode Type: booleanDefault: false: Enable thinking mode for step-by-step reasoning
max_new_tokens Type: integerDefault: 1024Range: 1 - 8192: Maximum number of tokens to generate
repetition_penalty Type: numberDefault: 1Range: 1 - 2: Repetition penalty (1.0 = no penalty)

Output Schema

Output

Type: string

Example Execution Logs

Processing video: /tmp/tmpqy2x4p3imonkey_on_skates.mp4
Extracted 8 frames from video
Processing 8 image(s)...
Generating response with max_new_tokens=1024, temperature=0.2...
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Generated response...

Version Details

Version ID: 92713bb56e2e4827477501fc512d80d7287bc34e4ccb7c306afd9bdaef8b7eeb
Version Created: October 22, 2025

Run on Replicate →