lucataco/qwen2.5-omni-7b 🖼️📝❓✓ → ❓

▶️ 14.1K runs 📅 Apr 2025 ⚙️ Cog 0.14.3 🔗 GitHub 📄 Paper ⚖️ License
speech-to-text text-generation text-to-speech video-to-text

About

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Example Output

Output

{"text":"Oh, that's a really cool drawing! It looks like a guitar. You've got the body and the neck drawn in a simple yet effective way. The lines are clean and the shape is well-defined. What made you choose to draw a guitar?","voice":"https://replicate.delivery/xezq/xpYIpzUZvRIIONe0qe1CfAcq53wfz46ecayvhcPCRY3mcI7jC/output.wav"}

Performance Metrics

113.05s Prediction Time
219.54s Total Time
All Input Parameters
{
  "video": "https://replicate.delivery/pbxt/MmJqxKbRSknHd9fwtTEbywWqDDdhgsx5tNYLIDnFqJ9j5ObC/draw.mp4",
  "voice_type": "Chelsie",
  "system_prompt": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
  "generate_audio": true,
  "use_audio_in_video": true
}
Input Parameters
audio Type: string
Optional audio input
image Type: string
Optional image input
video Type: string
Optional video input
prompt Type: string
Text prompt for the model
voice_type Default: Chelsie
Voice type for audio output
system_prompt Type: stringDefault: You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.
System prompt for the model
generate_audio Type: booleanDefault: true
Whether to generate audio output
use_audio_in_video Type: booleanDefault: true
Whether to use audio in video
Output Schema
text Type: string
Text
voice Type: stringFormat: uri
Voice
Example Execution Logs
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/qwen_omni_utils/v2_5/audio_process.py:50: UserWarning: PySoundFile failed. Trying audioread instead.
audios.append(librosa.load(path, sr=16000)[0])
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
Deprecated as of librosa version 0.10.0.
It will be removed in librosa version 1.0.
y, sr_native = __audioread_load(path, offset, duration, dtype)
qwen-vl-utils using decord to read video.
Setting `pad_token_id` to `eos_token_id`:8292 for open-end generation.
Version Details
Version ID
0ca8160f7aaf85703a6aac282d6c79aa64d3541b239fa4c5c1688b10cb1faef1
Version Created
April 4, 2025
Run on Replicate →