resemble-ai/chatterbox 🔢📝🖼️ → 🖼️

⭐ Official ▶️ 234.8K runs 📅 Jun 2025 ⚙️ Cog 0.15.5 ⚖️ License

audio-watermarking text-to-speech voice-cloning

About

Generate expressive, natural speech. Features unique emotion control, instant voice cloning from short audio, and built-in watermarking.

Example Output

Prompt:

We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.

Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.

If you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.

Output

Performance Metrics

12.77s Prediction Time

12.77s Total Time

All Input Parameters

{
  "seed": 0,
  "prompt": "We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.\n\nWhether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.\n\nIf you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.",
  "cfg_weight": 0.5,
  "temperature": 0.8,
  "exaggeration": 0.5
}

Input Parameters

seed Type: integerDefault: 0: Seed (0 for random)
prompt (required) Type: string: Text to synthesize
cfg_weight Type: numberDefault: 0.5Range: 0.2 - 1: CFG/Pace weight
temperature Type: numberDefault: 0.8Range: 0.05 - 5: Temperature
audio_prompt Type: string: Path to the reference audio file (Optional)
exaggeration Type: numberDefault: 0.5Range: 0.25 - 2: Exaggeration (Neutral = 0.5, extreme values can be unstable)

Output Schema

Output

Type: string • Format: uri

Example Execution Logs

Using seed: 35127
Prompt: We're excited to introduce Chatterbox, our first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. Try it now on our Hugging Face Gradio app.
If you like the model but need to scale or finetune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/scope.py:22: ExperimentalFeatureWarning: current_scope is an experimental internal function. It may change or be removed without warning.
warnings.warn(
/root/.pyenv/versions/3.11.10/lib/python3.11/contextlib.py:105: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
Sampling:   0%|          | 0/1000 [00:00<?, ?it/s]
Sampling:   1%|          | 6/1000 [00:00<00:18, 55.18it/s]
Sampling:   2%|▏         | 15/1000 [00:00<00:13, 74.76it/s]
Sampling:   2%|▏         | 24/1000 [00:00<00:12, 81.19it/s]
Sampling:   3%|▎         | 33/1000 [00:00<00:11, 83.63it/s]
Sampling:   4%|▍         | 42/1000 [00:00<00:11, 85.73it/s]
Sampling:   5%|▌         | 51/1000 [00:00<00:10, 87.05it/s]
Sampling:   6%|▌         | 61/1000 [00:00<00:10, 88.09it/s]
Sampling:   7%|▋         | 70/1000 [00:00<00:10, 88.63it/s]
Sampling:   8%|▊         | 79/1000 [00:00<00:10, 88.90it/s]
Sampling:   9%|▉         | 88/1000 [00:01<00:10, 89.23it/s]
Sampling:  10%|▉         | 98/1000 [00:01<00:10, 89.51it/s]
Sampling:  11%|█         | 107/1000 [00:01<00:10, 88.86it/s]
Sampling:  12%|█▏        | 116/1000 [00:01<00:09, 88.84it/s]
Sampling:  12%|█▎        | 125/1000 [00:01<00:10, 85.84it/s]
Sampling:  13%|█▎        | 134/1000 [00:01<00:09, 86.92it/s]
Sampling:  14%|█▍        | 143/1000 [00:01<00:09, 87.78it/s]
Sampling:  15%|█▌        | 153/1000 [00:01<00:09, 88.63it/s]
Sampling:  16%|█▋        | 163/1000 [00:01<00:09, 89.36it/s]
Sampling:  17%|█▋        | 172/1000 [00:01<00:09, 89.25it/s]
Sampling:  18%|█▊        | 181/1000 [00:02<00:09, 89.10it/s]
Sampling:  19%|█▉        | 190/1000 [00:02<00:09, 88.12it/s]
Sampling:  20%|█▉        | 199/1000 [00:02<00:09, 88.61it/s]
Sampling:  21%|██        | 208/1000 [00:02<00:08, 88.88it/s]
Sampling:  22%|██▏       | 217/1000 [00:02<00:08, 88.95it/s]
Sampling:  23%|██▎       | 226/1000 [00:02<00:09, 85.92it/s]
Sampling:  24%|██▎       | 235/1000 [00:02<00:08, 85.71it/s]
Sampling:  24%|██▍       | 245/1000 [00:02<00:08, 87.17it/s]
Sampling:  26%|██▌       | 255/1000 [00:02<00:08, 88.27it/s]
Sampling:  26%|██▋       | 265/1000 [00:03<00:08, 89.05it/s]
Sampling:  27%|██▋       | 274/1000 [00:03<00:08, 89.05it/s]
Sampling:  28%|██▊       | 284/1000 [00:03<00:07, 89.58it/s]
Sampling:  29%|██▉       | 294/1000 [00:03<00:07, 89.95it/s]
Sampling:  30%|███       | 303/1000 [00:03<00:07, 89.50it/s]
Sampling:  31%|███       | 312/1000 [00:03<00:07, 89.21it/s]
Sampling:  32%|███▏      | 322/1000 [00:03<00:07, 88.06it/s]
Sampling:  33%|███▎      | 331/1000 [00:03<00:08, 82.01it/s]
Sampling:  34%|███▍      | 340/1000 [00:03<00:08, 76.62it/s]
Sampling:  35%|███▍      | 349/1000 [00:04<00:08, 79.28it/s]
Sampling:  36%|███▌      | 358/1000 [00:04<00:07, 82.15it/s]
Sampling:  37%|███▋      | 368/1000 [00:04<00:07, 84.76it/s]
Sampling:  38%|███▊      | 377/1000 [00:04<00:07, 85.59it/s]
Sampling:  39%|███▊      | 386/1000 [00:04<00:07, 86.24it/s]
Sampling:  40%|███▉      | 395/1000 [00:04<00:06, 87.28it/s]
Sampling:  40%|████      | 404/1000 [00:04<00:06, 87.98it/s]
Sampling:  41%|████▏     | 414/1000 [00:04<00:06, 88.65it/s]
Sampling:  42%|████▏     | 423/1000 [00:04<00:06, 88.67it/s]
Sampling:  43%|████▎     | 432/1000 [00:04<00:06, 87.79it/s]
Sampling:  44%|████▍     | 441/1000 [00:05<00:06, 81.04it/s]
Sampling:  45%|████▌     | 450/1000 [00:05<00:06, 83.47it/s]
Sampling:  46%|████▌     | 459/1000 [00:05<00:06, 84.75it/s]
Sampling:  47%|████▋     | 468/1000 [00:05<00:06, 85.95it/s]
Sampling:  48%|████▊     | 477/1000 [00:05<00:06, 86.40it/s]
Sampling:  49%|████▊     | 486/1000 [00:05<00:05, 86.96it/s]
Sampling:  50%|████▉     | 495/1000 [00:05<00:06, 84.11it/s]
Sampling:  50%|█████     | 504/1000 [00:05<00:06, 79.72it/s]
Sampling:  51%|█████▏    | 513/1000 [00:05<00:06, 75.46it/s]
Sampling:  52%|█████▏    | 521/1000 [00:06<00:06, 74.52it/s]
Sampling:  53%|█████▎    | 529/1000 [00:06<00:06, 74.46it/s]
Sampling:  54%|█████▍    | 538/1000 [00:06<00:05, 77.79it/s]
Sampling:  55%|█████▍    | 547/1000 [00:06<00:05, 79.97it/s]
Sampling:  56%|█████▌    | 556/1000 [00:06<00:05, 81.23it/s]
Sampling:  56%|█████▋    | 565/1000 [00:06<00:05, 82.66it/s]
Sampling:  57%|█████▋    | 574/1000 [00:06<00:05, 83.38it/s]
Sampling:  58%|█████▊    | 583/1000 [00:06<00:04, 84.27it/s]
Sampling:  59%|█████▉    | 592/1000 [00:06<00:04, 85.06it/s]
Sampling:  60%|██████    | 601/1000 [00:07<00:04, 84.42it/s]
Sampling:  61%|██████    | 610/1000 [00:07<00:04, 84.80it/s]
Sampling:  62%|██████▏   | 619/1000 [00:07<00:04, 84.83it/s]
Sampling:  63%|██████▎   | 628/1000 [00:07<00:04, 84.92it/s]
Sampling:  64%|██████▎   | 637/1000 [00:07<00:04, 84.61it/s]
Sampling:  65%|██████▍   | 646/1000 [00:07<00:04, 85.21it/s]
Sampling:  66%|██████▌   | 655/1000 [00:07<00:04, 85.73it/s]
Sampling:  66%|██████▋   | 664/1000 [00:07<00:03, 84.85it/s]
Sampling:  67%|██████▋   | 673/1000 [00:07<00:03, 85.35it/s]
Sampling:  68%|██████▊   | 682/1000 [00:08<00:03, 85.06it/s]
Sampling:  69%|██████▉   | 691/1000 [00:08<00:03, 85.40it/s]
Sampling:  70%|███████   | 700/1000 [00:08<00:03, 85.67it/s]
Sampling:  71%|███████   | 709/1000 [00:08<00:03, 86.28it/s]
Sampling:  72%|███████▏  | 718/1000 [00:08<00:03, 87.08it/s]
Sampling:  73%|███████▎  | 727/1000 [00:08<00:03, 87.74it/s]
Sampling:  74%|███████▎  | 736/1000 [00:08<00:03, 87.56it/s]
Sampling:  74%|███████▍  | 745/1000 [00:08<00:02, 87.25it/s]
Sampling:  75%|███████▌  | 754/1000 [00:08<00:02, 86.45it/s]
Sampling:  76%|███████▋  | 763/1000 [00:08<00:02, 85.54it/s]
Sampling:  77%|███████▋  | 772/1000 [00:09<00:02, 85.73it/s]
Sampling:  78%|███████▊  | 781/1000 [00:09<00:02, 86.06it/s]
Sampling:  79%|███████▉  | 790/1000 [00:09<00:02, 74.94it/s]
Sampling:  80%|███████▉  | 799/1000 [00:09<00:02, 76.75it/s]
Sampling:  81%|████████  | 808/1000 [00:09<00:02, 79.31it/s]
Sampling:  82%|████████▏ | 817/1000 [00:09<00:02, 81.78it/s]
Sampling:  83%|████████▎ | 826/1000 [00:09<00:02, 83.09it/s]
Sampling:  84%|████████▎ | 835/1000 [00:09<00:01, 83.62it/s]
Sampling:  84%|████████▍ | 844/1000 [00:09<00:01, 83.99it/s]
Sampling:  85%|████████▌ | 853/1000 [00:10<00:01, 84.38it/s]
Sampling:  86%|████████▌ | 862/1000 [00:10<00:01, 84.34it/s]
Sampling:  87%|████████▋ | 871/1000 [00:10<00:01, 80.87it/s]
Sampling:  88%|████████▊ | 880/1000 [00:10<00:01, 81.79it/s]
Sampling:  89%|████████▉ | 889/1000 [00:10<00:01, 81.93it/s]
Sampling:  90%|████████▉ | 898/1000 [00:10<00:01, 83.45it/s]
Sampling:  91%|█████████ | 907/1000 [00:10<00:01, 84.63it/s]
Sampling:  92%|█████████▏| 916/1000 [00:10<00:00, 85.65it/s]
Sampling:  92%|█████████▎| 925/1000 [00:10<00:00, 86.04it/s]
Sampling:  93%|█████████▎| 934/1000 [00:11<00:00, 84.32it/s]
Sampling:  94%|█████████▍| 943/1000 [00:11<00:00, 84.61it/s]
Sampling:  95%|█████████▌| 952/1000 [00:11<00:00, 85.10it/s]
Sampling:  96%|█████████▌| 961/1000 [00:11<00:00, 84.62it/s]
Sampling:  97%|█████████▋| 970/1000 [00:11<00:00, 85.25it/s]
Sampling:  98%|█████████▊| 979/1000 [00:11<00:00, 85.66it/s]
Sampling:  98%|█████████▊| 981/1000 [00:11<00:00, 84.85it/s]
Input character count: 809
Characters per second: 63.55
Total time: 12.73s

Version Details

Version ID: 1b8422bc49635c20d0a84e387ed20879c0dd09254ecdb4e75dc4bec10ff94e97
Version Created: June 20, 2025

Run on Replicate →