resemble-ai/chatterbox-multilingual 🔢📝❓🖼️ → 🖼️

▶️ 2.3K runs 📅 Sep 2025 ⚙️ Cog 0.16.2 🔗 GitHub ⚖️ License
multilingual multilingual-tts text-to-speech voice-cloning

About

Generate expressive, natural speech in 23 languages. Features instant voice cloning from short audio, emotion control, and seamless cross-language voice transfer.

Example Output

Output

Example output

Performance Metrics

4.86s Prediction Time
4.87s Total Time
All Input Parameters
{
  "seed": 0,
  "text": "Replicate est une entreprise véritablement impressionnante. Elle se distingue par sa capacité à rendre accessibles des modèles d’intelligence artificielle puissants à un grand nombre d’utilisateurs, qu’ils soient chercheurs, développeurs ou créateurs.",
  "language": "fr",
  "cfg_weight": 0.5,
  "temperature": 0.8,
  "exaggeration": 0.5,
  "reference_audio": "https://replicate.delivery/pbxt/Nefzy5KEvtNzIKfBX2O0GDdplN0IWaa7cfZxMAJ3hc4a4JEc/French_Female.mp3"
}
Input Parameters
seed Type: integerDefault: 0
Random seed for reproducible results (0 for random generation)
text (required) Type: string
Text to synthesize into speech (maximum 300 characters)
language Default: en
Language for synthesis
cfg_weight Type: numberDefault: 0.5Range: 0.2 - 1
CFG/Pace weight controlling generation guidance (0.2-1.0). Use 0.5 for balanced results, 0 for language transfer
temperature Type: numberDefault: 0.8Range: 0.05 - 5
Controls randomness in generation (0.05-5.0, higher=more varied)
exaggeration Type: numberDefault: 0.5Range: 0.25 - 2
Controls speech expressiveness (0.25-2.0, neutral=0.5, extreme values may be unstable)
reference_audio Type: string
Reference audio file for voice cloning (optional). If not provided, uses default voice for the selected language.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
🗣️  Generating audio for text: 'Replicate est une entreprise véritablement impress...'
🌍 Language: French (fr)
📎 Using audio prompt: /tmp/tmpngbn07ndFrench_Female.mp3
Sampling:   0%|          | 0/1000 [00:00<?, ?it/s]
Sampling:   0%|          | 3/1000 [00:00<00:35, 28.24it/s]
Sampling:   1%|▏         | 13/1000 [00:00<00:14, 66.57it/s]
Sampling:   2%|▏         | 23/1000 [00:00<00:12, 78.66it/s]
Sampling:   3%|▎         | 33/1000 [00:00<00:11, 84.00it/s]
Sampling:   4%|▍         | 43/1000 [00:00<00:10, 87.24it/s]
Sampling:   5%|▌         | 53/1000 [00:00<00:10, 89.20it/s]
Sampling:   6%|▋         | 63/1000 [00:00<00:10, 90.48it/s]
Sampling:   7%|▋         | 73/1000 [00:00<00:10, 91.46it/s]
Sampling:   8%|▊         | 83/1000 [00:00<00:09, 92.03it/s]
Sampling:   9%|▉         | 93/1000 [00:01<00:09, 92.42it/s]
Sampling:  10%|█         | 103/1000 [00:01<00:09, 91.86it/s]
Sampling:  11%|█▏        | 113/1000 [00:01<00:09, 92.25it/s]
Sampling:  12%|█▏        | 123/1000 [00:01<00:09, 92.37it/s]
Sampling:  13%|█▎        | 133/1000 [00:01<00:09, 92.45it/s]
Sampling:  14%|█▍        | 143/1000 [00:01<00:09, 91.21it/s]
Sampling:  15%|█▌        | 153/1000 [00:01<00:09, 90.96it/s]
Sampling:  16%|█▋        | 163/1000 [00:01<00:09, 90.32it/s]
Sampling:  17%|█▋        | 173/1000 [00:01<00:09, 90.77it/s]
Sampling:  18%|█▊        | 183/1000 [00:02<00:08, 90.83it/s]
Sampling:  19%|█▉        | 193/1000 [00:02<00:08, 90.72it/s]
Sampling:  20%|██        | 203/1000 [00:02<00:09, 88.08it/s]
Sampling:  21%|██        | 212/1000 [00:02<00:09, 87.54it/s]
Sampling:  22%|██▏       | 221/1000 [00:02<00:08, 87.06it/s]
Sampling:  23%|██▎       | 230/1000 [00:02<00:08, 86.42it/s]
Sampling:  24%|██▍       | 239/1000 [00:02<00:08, 85.07it/s]
Sampling:  25%|██▍       | 248/1000 [00:02<00:08, 85.26it/s]
Sampling:  26%|██▌       | 257/1000 [00:02<00:08, 85.42it/s]
Sampling:  27%|██▋       | 266/1000 [00:03<00:08, 84.61it/s]
Sampling:  28%|██▊       | 275/1000 [00:03<00:08, 84.88it/s]
Sampling:  28%|██▊       | 284/1000 [00:03<00:08, 84.10it/s]
Sampling:  29%|██▉       | 293/1000 [00:03<00:08, 84.54it/s]
Sampling:  30%|███       | 302/1000 [00:03<00:08, 83.75it/s]
Sampling:  31%|███       | 311/1000 [00:03<00:08, 83.95it/s]
Sampling:  32%|███▏      | 320/1000 [00:03<00:08, 84.26it/s]
Sampling:  33%|███▎      | 329/1000 [00:03<00:07, 84.01it/s]WARNING:chatterbox.models.t3.inference.alignment_stream_analyzer:forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False
Sampling:  33%|███▎      | 332/1000 [00:03<00:07, 86.75it/s]
✅ Audio generation complete.
Version Details
Version ID
9cfba4c265e685f840612be835424f8c33bdee685d7466ece7684b0d9d4c0b1c
Version Created
September 3, 2025
Run on Replicate →