nvidia/nemotron-nano-v2-12b-vl 🔢🖼️📝 → 📝
About
A multi-modal AI model for visual Q&A, summarization, and data extraction, supporting text, images, and video.
Example Output
Prompt:
"Describe what you see in this video"
Output
This video demonstrates a 3D modeling software interface. The video shows a 3D model of a cabin surrounded by trees and snow. The cabin is brown with a slanted roof covered in snow. The surrounding area is covered in snow, with trees and logs scattered around. The video shows the model from different angles, highlighting the details of the cabin and the surrounding environment. The software interface includes various tools and settings, such as the timeline, properties panel, and viewport. The video also shows the model being rotated and zoomed in and out.
Performance Metrics
51.95s
Prediction Time
197.94s
Total Time
All Input Parameters
{
"top_p": 1,
"video": "https://replicate.delivery/pbxt/Nxs0zAWevuwuqqQKhG7APDGLTXGQKIOVleq3buIYoQJn78cv/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4",
"prompt": "Describe what you see in this video",
"video_fps": 1,
"temperature": 0,
"system_prompt": "/no_think",
"repetition_penalty": 1,
"video_pruning_rate": 0.75
}
Input Parameters
- top_p
- Nucleus sampling top-p
- video
- Input video file (provide either images or video, not both)
- images
- List of input images (1-4 images supported)
- prompt
- Text prompt or question about the media
- video_fps
- Frames per second to extract from video (only used for video input)
- temperature
- Sampling temperature (0 for greedy decoding)
- system_prompt
- System prompt (/no_think disables chain-of-thought reasoning)
- max_new_tokens
- Maximum number of tokens to generate (default: 1024 for images, 128 for videos)
- repetition_penalty
- Repetition penalty to reduce repetitive text (1.0 = no penalty)
- video_pruning_rate
- Video pruning rate for efficiency (0.0=no pruning, 1.0=max pruning, only used for video)
Output Schema
Output
Example Execution Logs
Extracted 11 frames from video at 1 fps pruning rate: 0.75, EVS mask: 704 tokens retained out of 2816 total video tokens (25.00%) Setting `pad_token_id` to `eos_token_id`:12 for open-end generation. NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was provided, so no cache will be returned.
Version Details
- Version ID
f455944681df5f1bfb4171a7e1100983bd0e788d40cb75f5ae799921732d0d19- Version Created
- October 26, 2025