zsxkib/voxtral โ๐ผ๏ธ๐๐ข โ ๐
About
Voxtral Mini (3B) + Small (24B)๐๏ธ Speech transcription and audio understanding in 8 languages๐ง

Example Output
Prompt:
"What can you tell me about this audio?"
Output
ู
ู ูุชุฌู ูุญุณู
ุงูู
ุนุฑูุฉ ูู
ุง ูู ุฃูุฏุงู ูู ู
ู ุชุฑููุง ููุตูุฏ ูู
ู ูุฑุงุกูุง ุงูููุงูุงุช ุงูู
ุชุญุฏุฉ ุงูุฃู
ุฑูููุฉ ูู ูุฐู ุงูู
ูุงุฌูุงุชุ
Performance Metrics
2.27s
Prediction Time
263.56s
Total Time
All Input Parameters
{ "mode": "transcription", "audio": "https://replicate.delivery/pbxt/NPqAcUKAPImFd6Sva7qzqwdl4UCvCsSlUKJTVPAaovXzJeIQ/arabic_news_report.mp3", "prompt": "What can you tell me about this audio?", "language": "Auto-detect", "max_tokens": 500, "model_size": "mini" }
Input Parameters
- mode
- Choose processing mode: 'transcription' converts speech to text, 'understanding' analyzes audio content using prompts.
- audio (required)
- Audio file to process.
- prompt
- Question or instruction for understanding mode (e.g., 'What is the speaker discussing?', 'Summarize this audio'). Ignored in transcription mode.
- language
- Audio language. 'Auto-detect' works for most content, or choose a specific language for better accuracy.
- max_tokens
- Maximum response length. Higher values allow longer outputs but increase processing time.
- model_size
- Model selection: 'mini' (3B) is faster and uses less GPU memory, 'small' (24B) provides higher accuracy for complex audio.
Output Schema
Output
Example Execution Logs
Using Voxtral Mini (3B) model Mode: transcription Auto-detecting language (using English as fallback) Processing audio for transcription... Generating transcription... Transcription completed: 107 characters
Version Details
- Version ID
f5d491cbd58d6b048de5da796a4c6267621147b261cc72f02ebee4f39a94d5c5
- Version Created
- July 24, 2025