zsxkib/voxtral ❓🖼️📝🔢 → 📝
About
Voxtral Mini (3B) + Small (24B)🎙️ Speech transcription and audio understanding in 8 languages🧠
Example Output
Prompt:
"What can you tell me about this audio?"
Output
من يتجه لحسم المعركة وما هي أهداف كل من تركيا وقصيد ومن وراءها الولايات المتحدة الأمريكية في هذه المواجهات؟
Performance Metrics
2.27s
Prediction Time
263.56s
Total Time
All Input Parameters
{
"mode": "transcription",
"audio": "https://replicate.delivery/pbxt/NPqAcUKAPImFd6Sva7qzqwdl4UCvCsSlUKJTVPAaovXzJeIQ/arabic_news_report.mp3",
"prompt": "What can you tell me about this audio?",
"language": "Auto-detect",
"max_tokens": 500,
"model_size": "mini"
}
Input Parameters
- mode
- Choose processing mode: 'transcription' converts speech to text, 'understanding' analyzes audio content using prompts.
- audio (required)
- Audio file to process.
- prompt
- Question or instruction for understanding mode (e.g., 'What is the speaker discussing?', 'Summarize this audio'). Ignored in transcription mode.
- language
- Audio language. 'Auto-detect' works for most content, or choose a specific language for better accuracy.
- max_tokens
- Maximum response length. Higher values allow longer outputs but increase processing time.
- model_size
- Model selection: 'mini' (3B) is faster and uses less GPU memory, 'small' (24B) provides higher accuracy for complex audio.
Output Schema
Output
Example Execution Logs
Using Voxtral Mini (3B) model Mode: transcription Auto-detecting language (using English as fallback) Processing audio for transcription... Generating transcription... Transcription completed: 107 characters
Version Details
- Version ID
f5d491cbd58d6b048de5da796a4c6267621147b261cc72f02ebee4f39a94d5c5- Version Created
- July 24, 2025