resemble-ai/chatterbox-turbo 🔢📝❓🖼️ → 🖼️
About
The fastest open source TTS model without sacrificing quality.
Example Output
Output
Performance Metrics
3.00s
Prediction Time
3.01s
Total Time
All Input Parameters
{
"text": "Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?",
"top_p": 0.95,
"voice": "Abigail",
"temperature": 0.8,
"reference_audio": "https://replicate.delivery/pbxt/OEt67TSAP4l1Aq36bC3DSaRR1oQ5VT5prVkVh0yioOWJBTiO/voice.wav",
"repetition_penalty": 1.2
}
Input Parameters
- seed
- Random seed for reproducible results. Leave blank for random generation.
- text (required)
- Text to synthesize into speech (maximum 500 characters). Supported paralinguistic tags you can include in your text: [clear throat], [sigh], [sush], [cough], [groan], [sniff], [gasp], [chuckle], [laugh] Example: "Oh, that's hilarious! [chuckle] Let me tell you more."
- top_k
- Top-k sampling. Limits vocabulary to top k tokens at each step.
- top_p
- Nucleus sampling threshold. Lower values make output more focused.
- voice
- Pre-made voice to use for synthesis. Ignored if reference_audio is provided.
- temperature
- Controls randomness in generation. Higher values produce more varied speech.
- reference_audio
- Reference audio file for voice cloning (optional). Must be longer than 5 seconds. If provided, overrides the voice selection.
- repetition_penalty
- Penalizes token repetition. Higher values reduce repetition.
Output Schema
Output
Example Execution Logs
Using random seed: 56273 Generating audio for text: 'Oh, that's hilarious! [chuckle] Um anyway, we do h...' Using reference audio: /tmp/tmp9ynu_rvpvoice.wav 0%| | 0/1000 [00:00<?, ?it/s] 1%|▏ | 14/1000 [00:00<00:07, 137.10it/s] 3%|▎ | 33/1000 [00:00<00:05, 166.12it/s] 5%|▌ | 52/1000 [00:00<00:05, 175.62it/s] 7%|▋ | 71/1000 [00:00<00:05, 180.10it/s] 9%|▉ | 90/1000 [00:00<00:05, 181.86it/s] 11%|█ | 109/1000 [00:00<00:04, 182.87it/s] 13%|█▎ | 128/1000 [00:00<00:04, 183.79it/s] 15%|█▍ | 147/1000 [00:00<00:04, 182.63it/s] 17%|█▋ | 166/1000 [00:00<00:04, 183.39it/s] 18%|█▊ | 185/1000 [00:01<00:04, 183.15it/s] 20%|██ | 204/1000 [00:01<00:04, 182.90it/s] 22%|██▏ | 223/1000 [00:01<00:04, 183.64it/s] 24%|██▍ | 242/1000 [00:01<00:04, 184.06it/s] 26%|██▌ | 261/1000 [00:01<00:04, 184.44it/s] 28%|██▊ | 280/1000 [00:01<00:03, 183.55it/s] 30%|██▉ | 299/1000 [00:01<00:03, 183.05it/s] 32%|███▏ | 318/1000 [00:01<00:03, 182.83it/s] 34%|███▎ | 337/1000 [00:01<00:03, 182.20it/s] 36%|███▌ | 356/1000 [00:01<00:03, 181.23it/s] 36%|███▋ | 363/1000 [00:02<00:03, 180.72it/s] S3 Token -> Mel Inference... 0%| | 0/2 [00:00<?, ?it/s] 100%|██████████| 2/2 [00:00<00:00, 25.75it/s] Audio generation complete.
Version Details
- Version ID
95c87b883ff3e842a1643044dff67f9d204f70a80228f24ff64bffe4a4b917d4- Version Created
- December 15, 2025