nvidia/nemotron-nano-v2-12b-vl 🔢🖼️📝 → 📝

▶️ 824 runs 📅 Oct 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License
image-to-text ocr video-to-text

About

A multi-modal AI model for visual Q&A, summarization, and data extraction, supporting text, images, and video.

Example Output

Prompt:

"Describe what you see in this video"

Output

This video demonstrates a 3D modeling software interface. The video shows a 3D model of a cabin surrounded by trees and snow. The cabin is brown with a slanted roof covered in snow. The surrounding area is covered in snow, with trees and logs scattered around. The video shows the model from different angles, highlighting the details of the cabin and the surrounding environment. The software interface includes various tools and settings, such as the timeline, properties panel, and viewport. The video also shows the model being rotated and zoomed in and out.

Performance Metrics

51.95s Prediction Time
197.94s Total Time
All Input Parameters
{
  "top_p": 1,
  "video": "https://replicate.delivery/pbxt/Nxs0zAWevuwuqqQKhG7APDGLTXGQKIOVleq3buIYoQJn78cv/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4",
  "prompt": "Describe what you see in this video",
  "video_fps": 1,
  "temperature": 0,
  "system_prompt": "/no_think",
  "repetition_penalty": 1,
  "video_pruning_rate": 0.75
}
Input Parameters
top_p Type: numberDefault: 1Range: 0 - 1
Nucleus sampling top-p
video Type: string
Input video file (provide either images or video, not both)
images Type: array
List of input images (1-4 images supported)
prompt Type: stringDefault: Describe what you see in detail.
Text prompt or question about the media
video_fps Type: integerDefault: 1Range: 1 - 30
Frames per second to extract from video (only used for video input)
temperature Type: numberDefault: 0Range: 0 - 2
Sampling temperature (0 for greedy decoding)
system_prompt Type: stringDefault: /no_think
System prompt (/no_think disables chain-of-thought reasoning)
max_new_tokens Type: integerRange: 1 - 2048
Maximum number of tokens to generate (default: 1024 for images, 128 for videos)
repetition_penalty Type: numberDefault: 1Range: 1 - 2
Repetition penalty to reduce repetitive text (1.0 = no penalty)
video_pruning_rate Type: numberDefault: 0.75Range: 0 - 1
Video pruning rate for efficiency (0.0=no pruning, 1.0=max pruning, only used for video)
Output Schema

Output

Type: string

Example Execution Logs
Extracted 11 frames from video at 1 fps
pruning rate: 0.75, EVS mask: 704 tokens retained out of 2816 total video tokens (25.00%)
Setting `pad_token_id` to `eos_token_id`:12 for open-end generation.
NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was provided, so no cache will be returned.
Version Details
Version ID
f455944681df5f1bfb4171a7e1100983bd0e788d40cb75f5ae799921732d0d19
Version Created
October 26, 2025
Run on Replicate →