resemble-ai/chatterbox-turbo 🔢📝❓🖼️ → 🖼️

⭐ Official ▶️ 29.4K runs 📅 Dec 2025 ⚙️ Cog 0.16.9 🔗 GitHub ⚖️ License
text-to-speech voice-cloning

About

The fastest open source TTS model without sacrificing quality.

Example Output

Output

Example output

Performance Metrics

3.00s Prediction Time
3.01s Total Time
All Input Parameters
{
  "text": "Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?",
  "top_p": 0.95,
  "voice": "Abigail",
  "temperature": 0.8,
  "reference_audio": "https://replicate.delivery/pbxt/OEt67TSAP4l1Aq36bC3DSaRR1oQ5VT5prVkVh0yioOWJBTiO/voice.wav",
  "repetition_penalty": 1.2
}
Input Parameters
seed Type: integer
Random seed for reproducible results. Leave blank for random generation.
text (required) Type: string
Text to synthesize into speech (maximum 500 characters). Supported paralinguistic tags you can include in your text: [clear throat], [sigh], [sush], [cough], [groan], [sniff], [gasp], [chuckle], [laugh] Example: "Oh, that's hilarious! [chuckle] Let me tell you more."
top_k Type: integerDefault: 1000Range: 1 - 2000
Top-k sampling. Limits vocabulary to top k tokens at each step.
top_p Type: numberDefault: 0.95Range: 0.5 - 1
Nucleus sampling threshold. Lower values make output more focused.
voice Default: Andy
Pre-made voice to use for synthesis. Ignored if reference_audio is provided.
temperature Type: numberDefault: 0.8Range: 0.05 - 2
Controls randomness in generation. Higher values produce more varied speech.
reference_audio Type: string
Reference audio file for voice cloning (optional). Must be longer than 5 seconds. If provided, overrides the voice selection.
repetition_penalty Type: numberDefault: 1.2Range: 1 - 2
Penalizes token repetition. Higher values reduce repetition.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
Using random seed: 56273
Generating audio for text: 'Oh, that's hilarious! [chuckle] Um anyway, we do h...'
Using reference audio: /tmp/tmp9ynu_rvpvoice.wav
  0%|          | 0/1000 [00:00<?, ?it/s]
  1%|▏         | 14/1000 [00:00<00:07, 137.10it/s]
  3%|▎         | 33/1000 [00:00<00:05, 166.12it/s]
  5%|▌         | 52/1000 [00:00<00:05, 175.62it/s]
  7%|▋         | 71/1000 [00:00<00:05, 180.10it/s]
  9%|▉         | 90/1000 [00:00<00:05, 181.86it/s]
 11%|█         | 109/1000 [00:00<00:04, 182.87it/s]
 13%|█▎        | 128/1000 [00:00<00:04, 183.79it/s]
 15%|█▍        | 147/1000 [00:00<00:04, 182.63it/s]
 17%|█▋        | 166/1000 [00:00<00:04, 183.39it/s]
 18%|█▊        | 185/1000 [00:01<00:04, 183.15it/s]
 20%|██        | 204/1000 [00:01<00:04, 182.90it/s]
 22%|██▏       | 223/1000 [00:01<00:04, 183.64it/s]
 24%|██▍       | 242/1000 [00:01<00:04, 184.06it/s]
 26%|██▌       | 261/1000 [00:01<00:04, 184.44it/s]
 28%|██▊       | 280/1000 [00:01<00:03, 183.55it/s]
 30%|██▉       | 299/1000 [00:01<00:03, 183.05it/s]
 32%|███▏      | 318/1000 [00:01<00:03, 182.83it/s]
 34%|███▎      | 337/1000 [00:01<00:03, 182.20it/s]
 36%|███▌      | 356/1000 [00:01<00:03, 181.23it/s]
36%|███▋      | 363/1000 [00:02<00:03, 180.72it/s]
S3 Token -> Mel Inference...
  0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████| 2/2 [00:00<00:00, 25.75it/s]
Audio generation complete.
Version Details
Version ID
95c87b883ff3e842a1643044dff67f9d204f70a80228f24ff64bffe4a4b917d4
Version Created
December 15, 2025
Run on Replicate →