lucataco/interactiveomni-8b 🔢🖼️📝✓ → ❓
About
A unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech
Example Output
Output
{"text":"A samurai in traditional armor sits contemplatively beneath a blooming cherry blossom tree, surrounded by a serene blue palette.","audio":"https://replicate.delivery/xezq/lUbB1TmpCLICNpMebjQ7nnyIfuHiFRbbaXUkjlpFrvPeHZNrA/omni_response.wav"}
Performance Metrics
3.66s
Prediction Time
3.67s
Total Time
All Input Parameters
{
"seed": 0,
"audio": "https://replicate.delivery/pbxt/O173PpOSmN02fFtPQJGQq89HY2KxOjOU0ELOb6wsFBeDe5A8/describe-image.mp3",
"top_p": 0.8,
"images": [
"https://replicate.delivery/pbxt/O173Pe49Q0tauhwGdNnP9psjJ6hMIYpAErSc0QFGvOOZXmBj/G5Fr3nYXwAAsojw.jpeg"
],
"max_tiles": 12,
"temperature": 0.7,
"system_prompt": "",
"max_new_tokens": 512,
"frames_per_tile": 8,
"enable_audio_output": true
}
Input Parameters
- seed
- Random seed for reproducible sampling. Set to -1 to disable seeding.
- audio
- Optional audio clip (WAV/MP3/FLAC). Resampled to 24 kHz automatically.
- top_p
- Top-p nucleus sampling parameter.
- video
- Optional video clip (MP4/MOV/WebM) for video-grounded conversation.
- images
- Optional list of images (PNG/JPG/WebP) to provide visual context.
- prompt
- User text prompt. Leave blank when providing only media inputs.
- max_tiles
- Maximum number of temporal tiles to sample when a video is provided.
- temperature
- Sampling temperature. Set to 0 for greedy decoding.
- system_prompt
- Optional system prompt. When audio output is enabled and this is left blank, a recommended prompt is injected automatically.
- max_new_tokens
- Maximum number of tokens to generate.
- frames_per_tile
- Number of frames sampled per tile when processing video.
- enable_audio_output
- Return generated speech along with text output.
Output Schema
Example Execution Logs
2025-11-06 21:45:31,604 - INFO - audio num: 1 2025-11-06 21:45:31,605 - INFO - audio_token_num: 69 2025-11-06 21:45:31,605 - INFO - image num: 1 2025-11-06 21:45:31,605 - INFO - dynamic ViT batch size: 13 Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation. 2025-11-06 21:45:32,437 - INFO - get fill token, need to append more text token 2025-11-06 21:45:32,437 - INFO - append 5 text token 2025-11-06 21:45:32,690 - INFO - fill_token index 25 next fill_token index 51 2025-11-06 21:45:32,690 - INFO - get fill token, need to append more text token 2025-11-06 21:45:32,690 - INFO - append 5 text token 2025-11-06 21:45:32,938 - INFO - fill_token index 51 next fill_token index 77 2025-11-06 21:45:32,938 - INFO - get fill token, need to append more text token 2025-11-06 21:45:32,938 - INFO - append 5 text token 2025-11-06 21:45:33,185 - INFO - fill_token index 77 next fill_token index 103 2025-11-06 21:45:33,185 - INFO - get fill token, need to append more text token 2025-11-06 21:45:33,185 - INFO - append 5 text token 2025-11-06 21:45:33,433 - INFO - fill_token index 103 next fill_token index 129 2025-11-06 21:45:33,433 - INFO - get fill token, need to append more text token 2025-11-06 21:45:33,433 - INFO - append 5 text token 2025-11-06 21:45:33,683 - INFO - fill_token index 129 next fill_token index 155 2025-11-06 21:45:33,683 - INFO - no more text token, decode until met eos 2025-11-06 21:45:34,875 - INFO - query: "<|im_start|>system\nYou are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech.<|im_end|>\n<|im_start|>user\n<image>\n<audio>\n<|im_end|>\n<|im_start|>assistant\n" 2025-11-06 21:45:34,875 - INFO - response: A samurai in traditional armor sits contemplatively beneath a blooming cherry blossom tree, surrounded by a serene blue palette.
Version Details
- Version ID
6d19412f763aa2b82d67174fcbb02ca3385740dae88d9ccbded1997f59e75d2f- Version Created
- November 6, 2025