resemble-ai/chatterbox π’ππΌοΈ β πΌοΈ
About
Generate expressive, natural speech. Features unique emotion control, instant voice cloning from short audio, and built-in watermarking.

Example Output
"
We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.
If you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200msβideal for production use in agents, applications, or interactive media.
"Output
Performance Metrics
All Input Parameters
{ "seed": 0, "prompt": "We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.\n\nWhether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.\n\nIf you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200msβideal for production use in agents, applications, or interactive media.", "cfg_weight": 0.5, "temperature": 0.8, "exaggeration": 0.5 }
Input Parameters
- seed
- Seed (0 for random)
- prompt (required)
- Text to synthesize
- cfg_weight
- CFG/Pace weight
- temperature
- Temperature
- audio_prompt
- Path to the reference audio file (Optional)
- exaggeration
- Exaggeration (Neutral = 0.5, extreme values can be unstable)
Output Schema
Output
Example Execution Logs
Using seed: 35127 Prompt: We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations. Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app. If you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200msβideal for production use in agents, applications, or interactive media. /root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/scope.py:22: ExperimentalFeatureWarning: current_scope is an experimental internal function. It may change or be removed without warning. warnings.warn( /root/.pyenv/versions/3.11.10/lib/python3.11/contextlib.py:105: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature. self.gen = func(*args, **kwds) Sampling: 0%| | 0/1000 [00:00<?, ?it/s] Sampling: 1%| | 6/1000 [00:00<00:18, 55.18it/s] Sampling: 2%|β | 15/1000 [00:00<00:13, 74.76it/s] Sampling: 2%|β | 24/1000 [00:00<00:12, 81.19it/s] Sampling: 3%|β | 33/1000 [00:00<00:11, 83.63it/s] Sampling: 4%|β | 42/1000 [00:00<00:11, 85.73it/s] Sampling: 5%|β | 51/1000 [00:00<00:10, 87.05it/s] Sampling: 6%|β | 61/1000 [00:00<00:10, 88.09it/s] Sampling: 7%|β | 70/1000 [00:00<00:10, 88.63it/s] Sampling: 8%|β | 79/1000 [00:00<00:10, 88.90it/s] Sampling: 9%|β | 88/1000 [00:01<00:10, 89.23it/s] Sampling: 10%|β | 98/1000 [00:01<00:10, 89.51it/s] Sampling: 11%|β | 107/1000 [00:01<00:10, 88.86it/s] Sampling: 12%|ββ | 116/1000 [00:01<00:09, 88.84it/s] Sampling: 12%|ββ | 125/1000 [00:01<00:10, 85.84it/s] Sampling: 13%|ββ | 134/1000 [00:01<00:09, 86.92it/s] Sampling: 14%|ββ | 143/1000 [00:01<00:09, 87.78it/s] Sampling: 15%|ββ | 153/1000 [00:01<00:09, 88.63it/s] Sampling: 16%|ββ | 163/1000 [00:01<00:09, 89.36it/s] Sampling: 17%|ββ | 172/1000 [00:01<00:09, 89.25it/s] Sampling: 18%|ββ | 181/1000 [00:02<00:09, 89.10it/s] Sampling: 19%|ββ | 190/1000 [00:02<00:09, 88.12it/s] Sampling: 20%|ββ | 199/1000 [00:02<00:09, 88.61it/s] Sampling: 21%|ββ | 208/1000 [00:02<00:08, 88.88it/s] Sampling: 22%|βββ | 217/1000 [00:02<00:08, 88.95it/s] Sampling: 23%|βββ | 226/1000 [00:02<00:09, 85.92it/s] Sampling: 24%|βββ | 235/1000 [00:02<00:08, 85.71it/s] Sampling: 24%|βββ | 245/1000 [00:02<00:08, 87.17it/s] Sampling: 26%|βββ | 255/1000 [00:02<00:08, 88.27it/s] Sampling: 26%|βββ | 265/1000 [00:03<00:08, 89.05it/s] Sampling: 27%|βββ | 274/1000 [00:03<00:08, 89.05it/s] Sampling: 28%|βββ | 284/1000 [00:03<00:07, 89.58it/s] Sampling: 29%|βββ | 294/1000 [00:03<00:07, 89.95it/s] Sampling: 30%|βββ | 303/1000 [00:03<00:07, 89.50it/s] Sampling: 31%|βββ | 312/1000 [00:03<00:07, 89.21it/s] Sampling: 32%|ββββ | 322/1000 [00:03<00:07, 88.06it/s] Sampling: 33%|ββββ | 331/1000 [00:03<00:08, 82.01it/s] Sampling: 34%|ββββ | 340/1000 [00:03<00:08, 76.62it/s] Sampling: 35%|ββββ | 349/1000 [00:04<00:08, 79.28it/s] Sampling: 36%|ββββ | 358/1000 [00:04<00:07, 82.15it/s] Sampling: 37%|ββββ | 368/1000 [00:04<00:07, 84.76it/s] Sampling: 38%|ββββ | 377/1000 [00:04<00:07, 85.59it/s] Sampling: 39%|ββββ | 386/1000 [00:04<00:07, 86.24it/s] Sampling: 40%|ββββ | 395/1000 [00:04<00:06, 87.28it/s] Sampling: 40%|ββββ | 404/1000 [00:04<00:06, 87.98it/s] Sampling: 41%|βββββ | 414/1000 [00:04<00:06, 88.65it/s] Sampling: 42%|βββββ | 423/1000 [00:04<00:06, 88.67it/s] Sampling: 43%|βββββ | 432/1000 [00:04<00:06, 87.79it/s] Sampling: 44%|βββββ | 441/1000 [00:05<00:06, 81.04it/s] Sampling: 45%|βββββ | 450/1000 [00:05<00:06, 83.47it/s] Sampling: 46%|βββββ | 459/1000 [00:05<00:06, 84.75it/s] Sampling: 47%|βββββ | 468/1000 [00:05<00:06, 85.95it/s] Sampling: 48%|βββββ | 477/1000 [00:05<00:06, 86.40it/s] Sampling: 49%|βββββ | 486/1000 [00:05<00:05, 86.96it/s] Sampling: 50%|βββββ | 495/1000 [00:05<00:06, 84.11it/s] Sampling: 50%|βββββ | 504/1000 [00:05<00:06, 79.72it/s] Sampling: 51%|ββββββ | 513/1000 [00:05<00:06, 75.46it/s] Sampling: 52%|ββββββ | 521/1000 [00:06<00:06, 74.52it/s] Sampling: 53%|ββββββ | 529/1000 [00:06<00:06, 74.46it/s] Sampling: 54%|ββββββ | 538/1000 [00:06<00:05, 77.79it/s] Sampling: 55%|ββββββ | 547/1000 [00:06<00:05, 79.97it/s] Sampling: 56%|ββββββ | 556/1000 [00:06<00:05, 81.23it/s] Sampling: 56%|ββββββ | 565/1000 [00:06<00:05, 82.66it/s] Sampling: 57%|ββββββ | 574/1000 [00:06<00:05, 83.38it/s] Sampling: 58%|ββββββ | 583/1000 [00:06<00:04, 84.27it/s] Sampling: 59%|ββββββ | 592/1000 [00:06<00:04, 85.06it/s] Sampling: 60%|ββββββ | 601/1000 [00:07<00:04, 84.42it/s] Sampling: 61%|ββββββ | 610/1000 [00:07<00:04, 84.80it/s] Sampling: 62%|βββββββ | 619/1000 [00:07<00:04, 84.83it/s] Sampling: 63%|βββββββ | 628/1000 [00:07<00:04, 84.92it/s] Sampling: 64%|βββββββ | 637/1000 [00:07<00:04, 84.61it/s] Sampling: 65%|βββββββ | 646/1000 [00:07<00:04, 85.21it/s] Sampling: 66%|βββββββ | 655/1000 [00:07<00:04, 85.73it/s] Sampling: 66%|βββββββ | 664/1000 [00:07<00:03, 84.85it/s] Sampling: 67%|βββββββ | 673/1000 [00:07<00:03, 85.35it/s] Sampling: 68%|βββββββ | 682/1000 [00:08<00:03, 85.06it/s] Sampling: 69%|βββββββ | 691/1000 [00:08<00:03, 85.40it/s] Sampling: 70%|βββββββ | 700/1000 [00:08<00:03, 85.67it/s] Sampling: 71%|βββββββ | 709/1000 [00:08<00:03, 86.28it/s] Sampling: 72%|ββββββββ | 718/1000 [00:08<00:03, 87.08it/s] Sampling: 73%|ββββββββ | 727/1000 [00:08<00:03, 87.74it/s] Sampling: 74%|ββββββββ | 736/1000 [00:08<00:03, 87.56it/s] Sampling: 74%|ββββββββ | 745/1000 [00:08<00:02, 87.25it/s] Sampling: 75%|ββββββββ | 754/1000 [00:08<00:02, 86.45it/s] Sampling: 76%|ββββββββ | 763/1000 [00:08<00:02, 85.54it/s] Sampling: 77%|ββββββββ | 772/1000 [00:09<00:02, 85.73it/s] Sampling: 78%|ββββββββ | 781/1000 [00:09<00:02, 86.06it/s] Sampling: 79%|ββββββββ | 790/1000 [00:09<00:02, 74.94it/s] Sampling: 80%|ββββββββ | 799/1000 [00:09<00:02, 76.75it/s] Sampling: 81%|ββββββββ | 808/1000 [00:09<00:02, 79.31it/s] Sampling: 82%|βββββββββ | 817/1000 [00:09<00:02, 81.78it/s] Sampling: 83%|βββββββββ | 826/1000 [00:09<00:02, 83.09it/s] Sampling: 84%|βββββββββ | 835/1000 [00:09<00:01, 83.62it/s] Sampling: 84%|βββββββββ | 844/1000 [00:09<00:01, 83.99it/s] Sampling: 85%|βββββββββ | 853/1000 [00:10<00:01, 84.38it/s] Sampling: 86%|βββββββββ | 862/1000 [00:10<00:01, 84.34it/s] Sampling: 87%|βββββββββ | 871/1000 [00:10<00:01, 80.87it/s] Sampling: 88%|βββββββββ | 880/1000 [00:10<00:01, 81.79it/s] Sampling: 89%|βββββββββ | 889/1000 [00:10<00:01, 81.93it/s] Sampling: 90%|βββββββββ | 898/1000 [00:10<00:01, 83.45it/s] Sampling: 91%|βββββββββ | 907/1000 [00:10<00:01, 84.63it/s] Sampling: 92%|ββββββββββ| 916/1000 [00:10<00:00, 85.65it/s] Sampling: 92%|ββββββββββ| 925/1000 [00:10<00:00, 86.04it/s] Sampling: 93%|ββββββββββ| 934/1000 [00:11<00:00, 84.32it/s] Sampling: 94%|ββββββββββ| 943/1000 [00:11<00:00, 84.61it/s] Sampling: 95%|ββββββββββ| 952/1000 [00:11<00:00, 85.10it/s] Sampling: 96%|ββββββββββ| 961/1000 [00:11<00:00, 84.62it/s] Sampling: 97%|ββββββββββ| 970/1000 [00:11<00:00, 85.25it/s] Sampling: 98%|ββββββββββ| 979/1000 [00:11<00:00, 85.66it/s] Sampling: 98%|ββββββββββ| 981/1000 [00:11<00:00, 84.85it/s] Input character count: 809 Characters per second: 63.55 Total time: 12.73s
Version Details
- Version ID
1b8422bc49635c20d0a84e387ed20879c0dd09254ecdb4e75dc4bec10ff94e97
- Version Created
- June 20, 2025