resemble-ai/chatterbox-multilingual 🔢📝❓🖼️ → 🖼️
About
Generate expressive, natural speech in 23 languages. Features instant voice cloning from short audio, emotion control, and seamless cross-language voice transfer.

Example Output
Output
Performance Metrics
4.86s
Prediction Time
4.87s
Total Time
All Input Parameters
{ "seed": 0, "text": "Replicate est une entreprise véritablement impressionnante. Elle se distingue par sa capacité à rendre accessibles des modèles d’intelligence artificielle puissants à un grand nombre d’utilisateurs, qu’ils soient chercheurs, développeurs ou créateurs.", "language": "fr", "cfg_weight": 0.5, "temperature": 0.8, "exaggeration": 0.5, "reference_audio": "https://replicate.delivery/pbxt/Nefzy5KEvtNzIKfBX2O0GDdplN0IWaa7cfZxMAJ3hc4a4JEc/French_Female.mp3" }
Input Parameters
- seed
- Random seed for reproducible results (0 for random generation)
- text (required)
- Text to synthesize into speech (maximum 300 characters)
- language
- Language for synthesis
- cfg_weight
- CFG/Pace weight controlling generation guidance (0.2-1.0). Use 0.5 for balanced results, 0 for language transfer
- temperature
- Controls randomness in generation (0.05-5.0, higher=more varied)
- exaggeration
- Controls speech expressiveness (0.25-2.0, neutral=0.5, extreme values may be unstable)
- reference_audio
- Reference audio file for voice cloning (optional). If not provided, uses default voice for the selected language.
Output Schema
Output
Example Execution Logs
🗣️ Generating audio for text: 'Replicate est une entreprise véritablement impress...' 🌍 Language: French (fr) 📎 Using audio prompt: /tmp/tmpngbn07ndFrench_Female.mp3 Sampling: 0%| | 0/1000 [00:00<?, ?it/s] Sampling: 0%| | 3/1000 [00:00<00:35, 28.24it/s] Sampling: 1%|▏ | 13/1000 [00:00<00:14, 66.57it/s] Sampling: 2%|▏ | 23/1000 [00:00<00:12, 78.66it/s] Sampling: 3%|▎ | 33/1000 [00:00<00:11, 84.00it/s] Sampling: 4%|▍ | 43/1000 [00:00<00:10, 87.24it/s] Sampling: 5%|▌ | 53/1000 [00:00<00:10, 89.20it/s] Sampling: 6%|▋ | 63/1000 [00:00<00:10, 90.48it/s] Sampling: 7%|▋ | 73/1000 [00:00<00:10, 91.46it/s] Sampling: 8%|▊ | 83/1000 [00:00<00:09, 92.03it/s] Sampling: 9%|▉ | 93/1000 [00:01<00:09, 92.42it/s] Sampling: 10%|█ | 103/1000 [00:01<00:09, 91.86it/s] Sampling: 11%|█▏ | 113/1000 [00:01<00:09, 92.25it/s] Sampling: 12%|█▏ | 123/1000 [00:01<00:09, 92.37it/s] Sampling: 13%|█▎ | 133/1000 [00:01<00:09, 92.45it/s] Sampling: 14%|█▍ | 143/1000 [00:01<00:09, 91.21it/s] Sampling: 15%|█▌ | 153/1000 [00:01<00:09, 90.96it/s] Sampling: 16%|█▋ | 163/1000 [00:01<00:09, 90.32it/s] Sampling: 17%|█▋ | 173/1000 [00:01<00:09, 90.77it/s] Sampling: 18%|█▊ | 183/1000 [00:02<00:08, 90.83it/s] Sampling: 19%|█▉ | 193/1000 [00:02<00:08, 90.72it/s] Sampling: 20%|██ | 203/1000 [00:02<00:09, 88.08it/s] Sampling: 21%|██ | 212/1000 [00:02<00:09, 87.54it/s] Sampling: 22%|██▏ | 221/1000 [00:02<00:08, 87.06it/s] Sampling: 23%|██▎ | 230/1000 [00:02<00:08, 86.42it/s] Sampling: 24%|██▍ | 239/1000 [00:02<00:08, 85.07it/s] Sampling: 25%|██▍ | 248/1000 [00:02<00:08, 85.26it/s] Sampling: 26%|██▌ | 257/1000 [00:02<00:08, 85.42it/s] Sampling: 27%|██▋ | 266/1000 [00:03<00:08, 84.61it/s] Sampling: 28%|██▊ | 275/1000 [00:03<00:08, 84.88it/s] Sampling: 28%|██▊ | 284/1000 [00:03<00:08, 84.10it/s] Sampling: 29%|██▉ | 293/1000 [00:03<00:08, 84.54it/s] Sampling: 30%|███ | 302/1000 [00:03<00:08, 83.75it/s] Sampling: 31%|███ | 311/1000 [00:03<00:08, 83.95it/s] Sampling: 32%|███▏ | 320/1000 [00:03<00:08, 84.26it/s] Sampling: 33%|███▎ | 329/1000 [00:03<00:07, 84.01it/s]WARNING:chatterbox.models.t3.inference.alignment_stream_analyzer:forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False Sampling: 33%|███▎ | 332/1000 [00:03<00:07, 86.75it/s] ✅ Audio generation complete.
Version Details
- Version ID
9cfba4c265e685f840612be835424f8c33bdee685d7466ece7684b0d9d4c0b1c
- Version Created
- September 3, 2025