resemble-ai/chatterbox-turbo 🔢📝❓🖼️ → 🖼️

⭐ Official ▶️ 548.4K runs 📅 Dec 2025 ⚙️ Cog 0.16.9 🔗 GitHub ⚖️ License

text-to-speech voice-cloning

Performance

3.0sTypical run time

548.4KTotal runs

About

The fastest open source TTS model without sacrificing quality.

Example Output

Output

Performance Metrics

3.00s Prediction Time

3.01s Total Time

All Input Parameters

{
  "text": "Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?",
  "top_p": 0.95,
  "voice": "Abigail",
  "temperature": 0.8,
  "reference_audio": "https://replicate.delivery/pbxt/OEt67TSAP4l1Aq36bC3DSaRR1oQ5VT5prVkVh0yioOWJBTiO/voice.wav",
  "repetition_penalty": 1.2
}

Input Parameters

seed Type: integer: Random seed for reproducible results. Leave blank for random generation.
text (required) Type: string: Text to synthesize into speech (maximum 500 characters). Supported paralinguistic tags you can include in your text: [clear throat], [sigh], [sush], [cough], [groan], [sniff], [gasp], [chuckle], [laugh] Example: "Oh, that's hilarious! [chuckle] Let me tell you more."
top_k Type: integerDefault: 1000Range: 1 - 2000: Top-k sampling. Limits vocabulary to top k tokens at each step.
top_p Type: numberDefault: 0.95Range: 0.5 - 1: Nucleus sampling threshold. Lower values make output more focused.
voice Default: Andy: Pre-made voice to use for synthesis. Ignored if reference_audio is provided.
temperature Type: numberDefault: 0.8Range: 0.05 - 2: Controls randomness in generation. Higher values produce more varied speech.
reference_audio Type: string: Reference audio file for voice cloning (optional). Must be longer than 5 seconds. If provided, overrides the voice selection.
repetition_penalty Type: numberDefault: 1.2Range: 1 - 2: Penalizes token repetition. Higher values reduce repetition.

Output Schema

Output

Type: string • Format: uri

Example Execution Logs

Using random seed: 56273
Generating audio for text: 'Oh, that's hilarious! [chuckle] Um anyway, we do h...'
Using reference audio: /tmp/tmp9ynu_rvpvoice.wav
  0%|          | 0/1000 [00:00<?, ?it/s]
  1%|▏         | 14/1000 [00:00<00:07, 137.10it/s]
  3%|▎         | 33/1000 [00:00<00:05, 166.12it/s]
  5%|▌         | 52/1000 [00:00<00:05, 175.62it/s]
  7%|▋         | 71/1000 [00:00<00:05, 180.10it/s]
  9%|▉         | 90/1000 [00:00<00:05, 181.86it/s]
 11%|█         | 109/1000 [00:00<00:04, 182.87it/s]
 13%|█▎        | 128/1000 [00:00<00:04, 183.79it/s]
 15%|█▍        | 147/1000 [00:00<00:04, 182.63it/s]
 17%|█▋        | 166/1000 [00:00<00:04, 183.39it/s]
 18%|█▊        | 185/1000 [00:01<00:04, 183.15it/s]
 20%|██        | 204/1000 [00:01<00:04, 182.90it/s]
 22%|██▏       | 223/1000 [00:01<00:04, 183.64it/s]
 24%|██▍       | 242/1000 [00:01<00:04, 184.06it/s]
 26%|██▌       | 261/1000 [00:01<00:04, 184.44it/s]
 28%|██▊       | 280/1000 [00:01<00:03, 183.55it/s]
 30%|██▉       | 299/1000 [00:01<00:03, 183.05it/s]
 32%|███▏      | 318/1000 [00:01<00:03, 182.83it/s]
 34%|███▎      | 337/1000 [00:01<00:03, 182.20it/s]
 36%|███▌      | 356/1000 [00:01<00:03, 181.23it/s]
36%|███▋      | 363/1000 [00:02<00:03, 180.72it/s]
S3 Token -> Mel Inference...
  0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████| 2/2 [00:00<00:00, 25.75it/s]
Audio generation complete.

Version Details

Version ID: 95c87b883ff3e842a1643044dff67f9d204f70a80228f24ff64bffe4a4b917d4
Version Created: December 15, 2025

Run on Replicate →