resemble-ai/chatterbox πŸ”’πŸ“πŸ–ΌοΈ β†’ πŸ–ΌοΈ

⭐ Official ▢️ 125.3K runs πŸ“… Jun 2025 βš™οΈ Cog 0.15.5 βš–οΈ License
text-to-speech voice-cloning

About

Generate expressive, natural speech. Features unique emotion control, instant voice cloning from short audio, and built-in watermarking.

Example Output

Prompt:

"

We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.

Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.

If you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200msβ€”ideal for production use in agents, applications, or interactive media.

"

Output

Example output

Performance Metrics

12.77s Prediction Time
12.77s Total Time
All Input Parameters
{
  "seed": 0,
  "prompt": "We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.\n\nWhether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.\n\nIf you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200msβ€”ideal for production use in agents, applications, or interactive media.",
  "cfg_weight": 0.5,
  "temperature": 0.8,
  "exaggeration": 0.5
}
Input Parameters
seed Type: integerDefault: 0
Seed (0 for random)
prompt (required) Type: string
Text to synthesize
cfg_weight Type: numberDefault: 0.5Range: 0.2 - 1
CFG/Pace weight
temperature Type: numberDefault: 0.8Range: 0.05 - 5
Temperature
audio_prompt Type: string
Path to the reference audio file (Optional)
exaggeration Type: numberDefault: 0.5Range: 0.25 - 2
Exaggeration (Neutral = 0.5, extreme values can be unstable)
Output Schema

Output

Type: string β€’ Format: uri

Example Execution Logs
Using seed: 35127
Prompt: We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.
If you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200msβ€”ideal for production use in agents, applications, or interactive media.
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/scope.py:22: ExperimentalFeatureWarning: current_scope is an experimental internal function. It may change or be removed without warning.
warnings.warn(
/root/.pyenv/versions/3.11.10/lib/python3.11/contextlib.py:105: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
Sampling:   0%|          | 0/1000 [00:00<?, ?it/s]
Sampling:   1%|          | 6/1000 [00:00<00:18, 55.18it/s]
Sampling:   2%|▏         | 15/1000 [00:00<00:13, 74.76it/s]
Sampling:   2%|▏         | 24/1000 [00:00<00:12, 81.19it/s]
Sampling:   3%|β–Ž         | 33/1000 [00:00<00:11, 83.63it/s]
Sampling:   4%|▍         | 42/1000 [00:00<00:11, 85.73it/s]
Sampling:   5%|β–Œ         | 51/1000 [00:00<00:10, 87.05it/s]
Sampling:   6%|β–Œ         | 61/1000 [00:00<00:10, 88.09it/s]
Sampling:   7%|β–‹         | 70/1000 [00:00<00:10, 88.63it/s]
Sampling:   8%|β–Š         | 79/1000 [00:00<00:10, 88.90it/s]
Sampling:   9%|β–‰         | 88/1000 [00:01<00:10, 89.23it/s]
Sampling:  10%|β–‰         | 98/1000 [00:01<00:10, 89.51it/s]
Sampling:  11%|β–ˆ         | 107/1000 [00:01<00:10, 88.86it/s]
Sampling:  12%|β–ˆβ–        | 116/1000 [00:01<00:09, 88.84it/s]
Sampling:  12%|β–ˆβ–Ž        | 125/1000 [00:01<00:10, 85.84it/s]
Sampling:  13%|β–ˆβ–Ž        | 134/1000 [00:01<00:09, 86.92it/s]
Sampling:  14%|β–ˆβ–        | 143/1000 [00:01<00:09, 87.78it/s]
Sampling:  15%|β–ˆβ–Œ        | 153/1000 [00:01<00:09, 88.63it/s]
Sampling:  16%|β–ˆβ–‹        | 163/1000 [00:01<00:09, 89.36it/s]
Sampling:  17%|β–ˆβ–‹        | 172/1000 [00:01<00:09, 89.25it/s]
Sampling:  18%|β–ˆβ–Š        | 181/1000 [00:02<00:09, 89.10it/s]
Sampling:  19%|β–ˆβ–‰        | 190/1000 [00:02<00:09, 88.12it/s]
Sampling:  20%|β–ˆβ–‰        | 199/1000 [00:02<00:09, 88.61it/s]
Sampling:  21%|β–ˆβ–ˆ        | 208/1000 [00:02<00:08, 88.88it/s]
Sampling:  22%|β–ˆβ–ˆβ–       | 217/1000 [00:02<00:08, 88.95it/s]
Sampling:  23%|β–ˆβ–ˆβ–Ž       | 226/1000 [00:02<00:09, 85.92it/s]
Sampling:  24%|β–ˆβ–ˆβ–Ž       | 235/1000 [00:02<00:08, 85.71it/s]
Sampling:  24%|β–ˆβ–ˆβ–       | 245/1000 [00:02<00:08, 87.17it/s]
Sampling:  26%|β–ˆβ–ˆβ–Œ       | 255/1000 [00:02<00:08, 88.27it/s]
Sampling:  26%|β–ˆβ–ˆβ–‹       | 265/1000 [00:03<00:08, 89.05it/s]
Sampling:  27%|β–ˆβ–ˆβ–‹       | 274/1000 [00:03<00:08, 89.05it/s]
Sampling:  28%|β–ˆβ–ˆβ–Š       | 284/1000 [00:03<00:07, 89.58it/s]
Sampling:  29%|β–ˆβ–ˆβ–‰       | 294/1000 [00:03<00:07, 89.95it/s]
Sampling:  30%|β–ˆβ–ˆβ–ˆ       | 303/1000 [00:03<00:07, 89.50it/s]
Sampling:  31%|β–ˆβ–ˆβ–ˆ       | 312/1000 [00:03<00:07, 89.21it/s]
Sampling:  32%|β–ˆβ–ˆβ–ˆβ–      | 322/1000 [00:03<00:07, 88.06it/s]
Sampling:  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 331/1000 [00:03<00:08, 82.01it/s]
Sampling:  34%|β–ˆβ–ˆβ–ˆβ–      | 340/1000 [00:03<00:08, 76.62it/s]
Sampling:  35%|β–ˆβ–ˆβ–ˆβ–      | 349/1000 [00:04<00:08, 79.28it/s]
Sampling:  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 358/1000 [00:04<00:07, 82.15it/s]
Sampling:  37%|β–ˆβ–ˆβ–ˆβ–‹      | 368/1000 [00:04<00:07, 84.76it/s]
Sampling:  38%|β–ˆβ–ˆβ–ˆβ–Š      | 377/1000 [00:04<00:07, 85.59it/s]
Sampling:  39%|β–ˆβ–ˆβ–ˆβ–Š      | 386/1000 [00:04<00:07, 86.24it/s]
Sampling:  40%|β–ˆβ–ˆβ–ˆβ–‰      | 395/1000 [00:04<00:06, 87.28it/s]
Sampling:  40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 404/1000 [00:04<00:06, 87.98it/s]
Sampling:  41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 414/1000 [00:04<00:06, 88.65it/s]
Sampling:  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 423/1000 [00:04<00:06, 88.67it/s]
Sampling:  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 432/1000 [00:04<00:06, 87.79it/s]
Sampling:  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 441/1000 [00:05<00:06, 81.04it/s]
Sampling:  45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 450/1000 [00:05<00:06, 83.47it/s]
Sampling:  46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 459/1000 [00:05<00:06, 84.75it/s]
Sampling:  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 468/1000 [00:05<00:06, 85.95it/s]
Sampling:  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 477/1000 [00:05<00:06, 86.40it/s]
Sampling:  49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 486/1000 [00:05<00:05, 86.96it/s]
Sampling:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 495/1000 [00:05<00:06, 84.11it/s]
Sampling:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 504/1000 [00:05<00:06, 79.72it/s]
Sampling:  51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 513/1000 [00:05<00:06, 75.46it/s]
Sampling:  52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 521/1000 [00:06<00:06, 74.52it/s]
Sampling:  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 529/1000 [00:06<00:06, 74.46it/s]
Sampling:  54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 538/1000 [00:06<00:05, 77.79it/s]
Sampling:  55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 547/1000 [00:06<00:05, 79.97it/s]
Sampling:  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 556/1000 [00:06<00:05, 81.23it/s]
Sampling:  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 565/1000 [00:06<00:05, 82.66it/s]
Sampling:  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 574/1000 [00:06<00:05, 83.38it/s]
Sampling:  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 583/1000 [00:06<00:04, 84.27it/s]
Sampling:  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 592/1000 [00:06<00:04, 85.06it/s]
Sampling:  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 601/1000 [00:07<00:04, 84.42it/s]
Sampling:  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 610/1000 [00:07<00:04, 84.80it/s]
Sampling:  62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 619/1000 [00:07<00:04, 84.83it/s]
Sampling:  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 628/1000 [00:07<00:04, 84.92it/s]
Sampling:  64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 637/1000 [00:07<00:04, 84.61it/s]
Sampling:  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 646/1000 [00:07<00:04, 85.21it/s]
Sampling:  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 655/1000 [00:07<00:04, 85.73it/s]
Sampling:  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 664/1000 [00:07<00:03, 84.85it/s]
Sampling:  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 673/1000 [00:07<00:03, 85.35it/s]
Sampling:  68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 682/1000 [00:08<00:03, 85.06it/s]
Sampling:  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 691/1000 [00:08<00:03, 85.40it/s]
Sampling:  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 700/1000 [00:08<00:03, 85.67it/s]
Sampling:  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 709/1000 [00:08<00:03, 86.28it/s]
Sampling:  72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 718/1000 [00:08<00:03, 87.08it/s]
Sampling:  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 727/1000 [00:08<00:03, 87.74it/s]
Sampling:  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 736/1000 [00:08<00:03, 87.56it/s]
Sampling:  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 745/1000 [00:08<00:02, 87.25it/s]
Sampling:  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 754/1000 [00:08<00:02, 86.45it/s]
Sampling:  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 763/1000 [00:08<00:02, 85.54it/s]
Sampling:  77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 772/1000 [00:09<00:02, 85.73it/s]
Sampling:  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 781/1000 [00:09<00:02, 86.06it/s]
Sampling:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 790/1000 [00:09<00:02, 74.94it/s]
Sampling:  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 799/1000 [00:09<00:02, 76.75it/s]
Sampling:  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 808/1000 [00:09<00:02, 79.31it/s]
Sampling:  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 817/1000 [00:09<00:02, 81.78it/s]
Sampling:  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 826/1000 [00:09<00:02, 83.09it/s]
Sampling:  84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 835/1000 [00:09<00:01, 83.62it/s]
Sampling:  84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 844/1000 [00:09<00:01, 83.99it/s]
Sampling:  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 853/1000 [00:10<00:01, 84.38it/s]
Sampling:  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 862/1000 [00:10<00:01, 84.34it/s]
Sampling:  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 871/1000 [00:10<00:01, 80.87it/s]
Sampling:  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 880/1000 [00:10<00:01, 81.79it/s]
Sampling:  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 889/1000 [00:10<00:01, 81.93it/s]
Sampling:  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 898/1000 [00:10<00:01, 83.45it/s]
Sampling:  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 907/1000 [00:10<00:01, 84.63it/s]
Sampling:  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 916/1000 [00:10<00:00, 85.65it/s]
Sampling:  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 925/1000 [00:10<00:00, 86.04it/s]
Sampling:  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 934/1000 [00:11<00:00, 84.32it/s]
Sampling:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 943/1000 [00:11<00:00, 84.61it/s]
Sampling:  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 952/1000 [00:11<00:00, 85.10it/s]
Sampling:  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 961/1000 [00:11<00:00, 84.62it/s]
Sampling:  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 970/1000 [00:11<00:00, 85.25it/s]
Sampling:  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 979/1000 [00:11<00:00, 85.66it/s]
Sampling:  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 981/1000 [00:11<00:00, 84.85it/s]
Input character count: 809
Characters per second: 63.55
Total time: 12.73s
Version Details
Version ID
1b8422bc49635c20d0a84e387ed20879c0dd09254ecdb4e75dc4bec10ff94e97
Version Created
June 20, 2025
Run on Replicate β†’