nvidia/nemotron-nano-v2-12b-vl 🔢🖼️📝 → 📝

▶️ 854 runs 📅 Oct 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License

image-to-text ocr video-to-text

About

A multi-modal AI model for visual Q&A, summarization, and data extraction, supporting text, images, and video.

Example Output

Prompt:

"Describe what you see in this video"

Output

This video demonstrates a 3D modeling software interface. The video shows a 3D model of a cabin surrounded by trees and snow. The cabin is brown with a slanted roof covered in snow. The surrounding area is covered in snow, with trees and logs scattered around. The video shows the model from different angles, highlighting the details of the cabin and the surrounding environment. The software interface includes various tools and settings, such as the timeline, properties panel, and viewport. The video also shows the model being rotated and zoomed in and out.

Performance Metrics

51.95s Prediction Time

197.94s Total Time

All Input Parameters

{
  "top_p": 1,
  "video": "https://replicate.delivery/pbxt/Nxs0zAWevuwuqqQKhG7APDGLTXGQKIOVleq3buIYoQJn78cv/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4",
  "prompt": "Describe what you see in this video",
  "video_fps": 1,
  "temperature": 0,
  "system_prompt": "/no_think",
  "repetition_penalty": 1,
  "video_pruning_rate": 0.75
}

Input Parameters

top_p Type: numberDefault: 1Range: 0 - 1: Nucleus sampling top-p
video Type: string: Input video file (provide either images or video, not both)
images Type: array: List of input images (1-4 images supported)
prompt Type: stringDefault: Describe what you see in detail.: Text prompt or question about the media
video_fps Type: integerDefault: 1Range: 1 - 30: Frames per second to extract from video (only used for video input)
temperature Type: numberDefault: 0Range: 0 - 2: Sampling temperature (0 for greedy decoding)
system_prompt Type: stringDefault: /no_think: System prompt (/no_think disables chain-of-thought reasoning)
max_new_tokens Type: integerRange: 1 - 2048: Maximum number of tokens to generate (default: 1024 for images, 128 for videos)
repetition_penalty Type: numberDefault: 1Range: 1 - 2: Repetition penalty to reduce repetitive text (1.0 = no penalty)
video_pruning_rate Type: numberDefault: 0.75Range: 0 - 1: Video pruning rate for efficiency (0.0=no pruning, 1.0=max pruning, only used for video)

Output Schema

Output

Type: string

Example Execution Logs

Extracted 11 frames from video at 1 fps
pruning rate: 0.75, EVS mask: 704 tokens retained out of 2816 total video tokens (25.00%)
Setting `pad_token_id` to `eos_token_id`:12 for open-end generation.
NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was provided, so no cache will be returned.

Version Details

Version ID: f455944681df5f1bfb4171a7e1100983bd0e788d40cb75f5ae799921732d0d19
Version Created: October 26, 2025

Run on Replicate →