lucataco/interactiveomni-8b 🔢🖼️📝✓ → ❓

▶️ 78 runs 📅 Nov 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License

image-to-text text-to-speech video-to-text

About

A unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech

Example Output

Output

{"text":"A samurai in traditional armor sits contemplatively beneath a blooming cherry blossom tree, surrounded by a serene blue palette.","audio":"https://replicate.delivery/xezq/lUbB1TmpCLICNpMebjQ7nnyIfuHiFRbbaXUkjlpFrvPeHZNrA/omni_response.wav"}

Performance Metrics

3.66s Prediction Time

3.67s Total Time

All Input Parameters

{
  "seed": 0,
  "audio": "https://replicate.delivery/pbxt/O173PpOSmN02fFtPQJGQq89HY2KxOjOU0ELOb6wsFBeDe5A8/describe-image.mp3",
  "top_p": 0.8,
  "images": [
    "https://replicate.delivery/pbxt/O173Pe49Q0tauhwGdNnP9psjJ6hMIYpAErSc0QFGvOOZXmBj/G5Fr3nYXwAAsojw.jpeg"
  ],
  "max_tiles": 12,
  "temperature": 0.7,
  "system_prompt": "",
  "max_new_tokens": 512,
  "frames_per_tile": 8,
  "enable_audio_output": true
}

Input Parameters

seed Type: integerDefault: 0: Random seed for reproducible sampling. Set to -1 to disable seeding.
audio Type: string: Optional audio clip (WAV/MP3/FLAC). Resampled to 24 kHz automatically.
top_p Type: numberDefault: 0.8Range: 0.01 - 1: Top-p nucleus sampling parameter.
video Type: string: Optional video clip (MP4/MOV/WebM) for video-grounded conversation.
images Type: array: Optional list of images (PNG/JPG/WebP) to provide visual context.
prompt Type: stringDefault:: User text prompt. Leave blank when providing only media inputs.
max_tiles Type: integerDefault: 12Range: 1 - 48: Maximum number of temporal tiles to sample when a video is provided.
temperature Type: numberDefault: 0.7Range: 0 - 2: Sampling temperature. Set to 0 for greedy decoding.
system_prompt Type: stringDefault:: Optional system prompt. When audio output is enabled and this is left blank, a recommended prompt is injected automatically.
max_new_tokens Type: integerDefault: 512Range: 32 - 2048: Maximum number of tokens to generate.
frames_per_tile Type: integerDefault: 8Range: 1 - 32: Number of frames sampled per tile when processing video.
enable_audio_output Type: booleanDefault: false: Return generated speech along with text output.

Output Schema

Example Execution Logs

2025-11-06 21:45:31,604 - INFO - audio num: 1
2025-11-06 21:45:31,605 - INFO - audio_token_num: 69
2025-11-06 21:45:31,605 - INFO - image num: 1
2025-11-06 21:45:31,605 - INFO - dynamic ViT batch size: 13
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
2025-11-06 21:45:32,437 - INFO - get fill token, need to append more text token
2025-11-06 21:45:32,437 - INFO - append 5 text token
2025-11-06 21:45:32,690 - INFO - fill_token index 25 next fill_token index 51
2025-11-06 21:45:32,690 - INFO - get fill token, need to append more text token
2025-11-06 21:45:32,690 - INFO - append 5 text token
2025-11-06 21:45:32,938 - INFO - fill_token index 51 next fill_token index 77
2025-11-06 21:45:32,938 - INFO - get fill token, need to append more text token
2025-11-06 21:45:32,938 - INFO - append 5 text token
2025-11-06 21:45:33,185 - INFO - fill_token index 77 next fill_token index 103
2025-11-06 21:45:33,185 - INFO - get fill token, need to append more text token
2025-11-06 21:45:33,185 - INFO - append 5 text token
2025-11-06 21:45:33,433 - INFO - fill_token index 103 next fill_token index 129
2025-11-06 21:45:33,433 - INFO - get fill token, need to append more text token
2025-11-06 21:45:33,433 - INFO - append 5 text token
2025-11-06 21:45:33,683 - INFO - fill_token index 129 next fill_token index 155
2025-11-06 21:45:33,683 - INFO - no more text token, decode until met eos
2025-11-06 21:45:34,875 - INFO - query: "<|im_start|>system\nYou are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech.<|im_end|>\n<|im_start|>user\n<image>\n<audio>\n<|im_end|>\n<|im_start|>assistant\n"
2025-11-06 21:45:34,875 - INFO - response: A samurai in traditional armor sits contemplatively beneath a blooming cherry blossom tree, surrounded by a serene blue palette.

Version Details

Version ID: 6d19412f763aa2b82d67174fcbb02ca3385740dae88d9ccbded1997f59e75d2f
Version Created: November 6, 2025

Run on Replicate →