zsxkib/audio-flamingo-3 ๐Ÿ–ผ๏ธ๐Ÿ“๐Ÿ”ขโœ“ โ†’ ๐Ÿ“

โ–ถ๏ธ 3.0K runs ๐Ÿ“… Jul 2025 โš™๏ธ Cog 0.15.10 ๐Ÿ”— GitHub ๐Ÿ“„ Paper โš–๏ธ License
audio-analysis audio-to-text music-understanding speech-to-text

About

๐ŸŽงAdvanced audio understanding with step-by-step reasoning๐Ÿ“ฃ

Example Output

Prompt:

"Answer"

Output

Some famous actors who started their careers on Broadway include Meryl Streep, Tom Hanks, and Julia St. Louis-Dreyfus.

Performance Metrics

2.51s Prediction Time
77.06s Total Time
All Input Parameters
{
  "audio": "https://replicate.delivery/pbxt/NNbVqHojUTdMWRv0siI8RhfccA3lbRlk4Lu1oSWmSHJ8t6xW/voice_0.mp3",
  "prompt": "Answer",
  "max_length": 0,
  "temperature": 0,
  "system_prompt": "",
  "enable_thinking": true
}
Input Parameters
audio (required) Type: string
Audio file to analyze. Supports speech, music, and sound effects. Maximum duration: 10 minutes.
prompt Type: stringDefault: Please describe this audio in detail.
Question or instruction about the audio
end_time Type: number
End time in seconds for audio segment analysis (optional). Must be greater than start_time.
max_length Type: integerDefault: 0Range: 0 - 2048
Maximum length of the response in tokens. Use 0 for model default, or specify 50-2048 for custom length.
start_time Type: number
Start time in seconds for audio segment analysis (optional). Useful for long audio files.
temperature Type: numberDefault: 0Range: 0 - 1
Controls response creativity and randomness. Use 0.0 for deterministic (default), 0.1-0.3 for factual analysis, 0.7-0.9 for creative interpretation.
system_prompt Type: stringDefault:
System instructions to customize the model's behavior, output format, or analysis style. Leave empty for default behavior.
enable_thinking Type: booleanDefault: false
Enable detailed chain-of-thought reasoning for complex analysis. False for faster responses, True for deeper insights.
Output Schema

Output

Type: string

Example Execution Logs
{'from': 'human', 'value': [<llava.media.Sound object at 0x77675cddea70>, '<sound>\nAnswer']}
2025-07-18 14:26:16.848 | WARNING  | llava.utils.media:extract_media:262 - Media token '<sound>' found in text: '<sound>
Answer'. Removed.
/src/llava/mm_utils.py:592: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
masks = torch.tensor(masks[0])
Version Details
Version ID
419bdd5ed04ba4e4609e66cc5082f6564e9d2c0836f9a286abe74bc20a357b84
Version Created
July 18, 2025
Run on Replicate โ†’