zsxkib/audio-flamingo-3 🖼️📝🔢✓ → 📝

▶️ 56.3K runs 📅 Jul 2025 ⚙️ Cog 0.15.10 🔗 GitHub 📄 Paper ⚖️ License

audio-analysis audio-to-text music-understanding speech-to-text

About

🎧Advanced audio understanding with step-by-step reasoning📣

Example Output

Prompt:

"Answer"

Output

Some famous actors who started their careers on Broadway include Meryl Streep, Tom Hanks, and Julia St. Louis-Dreyfus.

Performance Metrics

2.51s Prediction Time

77.06s Total Time

All Input Parameters

{
  "audio": "https://replicate.delivery/pbxt/NNbVqHojUTdMWRv0siI8RhfccA3lbRlk4Lu1oSWmSHJ8t6xW/voice_0.mp3",
  "prompt": "Answer",
  "max_length": 0,
  "temperature": 0,
  "system_prompt": "",
  "enable_thinking": true
}

Input Parameters

audio (required) Type: string: Audio file to analyze. Supports speech, music, and sound effects. Maximum duration: 10 minutes.
prompt Type: stringDefault: Please describe this audio in detail.: Question or instruction about the audio
end_time Type: number: End time in seconds for audio segment analysis (optional). Must be greater than start_time.
max_length Type: integerDefault: 0Range: 0 - 2048: Maximum length of the response in tokens. Use 0 for model default, or specify 50-2048 for custom length.
start_time Type: number: Start time in seconds for audio segment analysis (optional). Useful for long audio files.
temperature Type: numberDefault: 0Range: 0 - 1: Controls response creativity and randomness. Use 0.0 for deterministic (default), 0.1-0.3 for factual analysis, 0.7-0.9 for creative interpretation.
system_prompt Type: stringDefault:: System instructions to customize the model's behavior, output format, or analysis style. Leave empty for default behavior.
enable_thinking Type: booleanDefault: false: Enable detailed chain-of-thought reasoning for complex analysis. False for faster responses, True for deeper insights.

Output Schema

Output

Type: string

Example Execution Logs

{'from': 'human', 'value': [<llava.media.Sound object at 0x77675cddea70>, '<sound>\nAnswer']}
2025-07-18 14:26:16.848 | WARNING  | llava.utils.media:extract_media:262 - Media token '<sound>' found in text: '<sound>
Answer'. Removed.
/src/llava/mm_utils.py:592: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
masks = torch.tensor(masks[0])

Version Details

Version ID: 419bdd5ed04ba4e4609e66cc5082f6564e9d2c0836f9a286abe74bc20a357b84
Version Created: July 18, 2025

Run on Replicate →