jaaari/zonos 🔢📝🖼️❓ → 🖼️

▶️ 4.8K runs 📅 Feb 2025 ⚙️ Cog 0.13.7 🔗 GitHub 📄 Paper ⚖️ License
multilingual text-to-speech voice-cloning

About

Zonos-v0.1 by Zyphra, voice cloning, 5 languages and emotion control

Example Output

Output

Example output

Performance Metrics

54.64s Prediction Time
185.68s Total Time
All Input Parameters
{
  "seed": 1,
  "text": "Hi! I'm Zonos, a text-to-speech model build by Zyphra. I can speak 5 languages and you can even control my emotions. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!",
  "audio": "https://replicate.delivery/pbxt/MUEtXI54W68rj2eUER8rrkaRNUPjtqZdVXN5hQnhmRVMBqwC/richard_sample.wav",
  "emotion": "",
  "language": "en-us",
  "model_type": "transformer",
  "speaking_rate": 15
}
Input Parameters
seed Type: integer
Seed for reproducibility (optional)
text (required) Type: string
Text to generate speech from
audio Type: string
Path to audio file for voice cloning (optional)
emotion Type: stringDefault:
Optionally pass a comma-separated list of 8 floats for your desired emotion vector in the order [Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral]. For example: '0.5,0.2,0.0,0.0,0.3,0.1,0.0,0.0'. If empty or invalid, defaults to the built-in neutralish emotion.
language Default: en-us
Language code for speech generation
model_type Default: transformer
Model type to use
speaking_rate Type: numberDefault: 15Range: 5 - 30
Speaking rate in phonemes per second. Default is 15.0. 10-12 is slow and clear, 15-17 is natural conversational, 20+ is fast. Values above 30 may produce artifacts.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
0%|          | 0/3060 [00:00<?, ?it/s]
  0%|          | 1/3060 [00:00<09:04,  5.61it/s]
  1%|          | 16/3060 [00:00<00:44, 68.83it/s]
  1%|          | 32/3060 [00:00<00:29, 102.22it/s]
  2%|▏         | 48/3060 [00:00<00:24, 120.62it/s]
  2%|▏         | 64/3060 [00:00<00:22, 131.66it/s]
  3%|▎         | 80/3060 [00:00<00:21, 138.61it/s]
  3%|▎         | 96/3060 [00:00<00:20, 143.09it/s]
  4%|▎         | 112/3060 [00:00<00:20, 146.08it/s]
  4%|▍         | 128/3060 [00:01<00:19, 148.11it/s]
  5%|▍         | 144/3060 [00:01<00:19, 149.55it/s]
  5%|▌         | 160/3060 [00:01<00:19, 150.42it/s]
  6%|▌         | 176/3060 [00:01<00:19, 150.98it/s]
  6%|▋         | 192/3060 [00:01<00:18, 151.36it/s]
  7%|▋         | 208/3060 [00:01<00:18, 151.66it/s]
  7%|▋         | 224/3060 [00:01<00:18, 151.82it/s]
  8%|▊         | 240/3060 [00:01<00:18, 151.85it/s]
  8%|▊         | 256/3060 [00:01<00:18, 151.88it/s]
  9%|▉         | 272/3060 [00:01<00:18, 151.90it/s]
  9%|▉         | 288/3060 [00:02<00:18, 151.93it/s]
 10%|▉         | 304/3060 [00:02<00:18, 151.81it/s]
 10%|█         | 320/3060 [00:02<00:18, 151.66it/s]
 11%|█         | 336/3060 [00:02<00:17, 151.57it/s]
 12%|█▏        | 352/3060 [00:02<00:17, 151.49it/s]
 12%|█▏        | 368/3060 [00:02<00:17, 151.43it/s]
 13%|█▎        | 384/3060 [00:02<00:17, 151.38it/s]
 13%|█▎        | 400/3060 [00:02<00:17, 151.35it/s]
 14%|█▎        | 416/3060 [00:02<00:17, 151.25it/s]
 14%|█▍        | 432/3060 [00:03<00:17, 151.09it/s]
 15%|█▍        | 448/3060 [00:03<00:17, 150.94it/s]
 15%|█▌        | 464/3060 [00:03<00:17, 150.80it/s]
 16%|█▌        | 480/3060 [00:03<00:17, 150.74it/s]
 16%|█▌        | 496/3060 [00:03<00:17, 150.64it/s]
 17%|█▋        | 512/3060 [00:03<00:16, 150.55it/s]
 17%|█▋        | 528/3060 [00:03<00:16, 150.42it/s]
 18%|█▊        | 544/3060 [00:03<00:16, 150.34it/s]
 18%|█▊        | 560/3060 [00:03<00:16, 150.22it/s]
 19%|█▉        | 576/3060 [00:03<00:16, 150.11it/s]
 19%|█▉        | 592/3060 [00:04<00:16, 150.00it/s]
 20%|█▉        | 607/3060 [00:04<00:16, 149.89it/s]
 20%|██        | 622/3060 [00:04<00:16, 149.83it/s]
 21%|██        | 637/3060 [00:04<00:16, 149.79it/s]
 21%|██▏       | 652/3060 [00:04<00:16, 149.72it/s]
 22%|██▏       | 667/3060 [00:04<00:16, 149.44it/s]
 22%|██▏       | 682/3060 [00:04<00:15, 149.37it/s]
 23%|██▎       | 697/3060 [00:04<00:15, 149.25it/s]
 23%|██▎       | 712/3060 [00:04<00:15, 149.18it/s]
 24%|██▍       | 727/3060 [00:04<00:15, 149.14it/s]
 24%|██▍       | 742/3060 [00:05<00:15, 149.09it/s]
 25%|██▍       | 757/3060 [00:05<00:15, 149.04it/s]
 25%|██▌       | 772/3060 [00:05<00:15, 149.02it/s]
 26%|██▌       | 787/3060 [00:05<00:15, 148.98it/s]
 26%|██▌       | 802/3060 [00:05<00:15, 148.90it/s]
 27%|██▋       | 817/3060 [00:05<00:15, 148.74it/s]
 27%|██▋       | 832/3060 [00:05<00:14, 148.57it/s]
 28%|██▊       | 847/3060 [00:05<00:14, 148.45it/s]
 28%|██▊       | 862/3060 [00:05<00:14, 148.32it/s]
 29%|██▊       | 877/3060 [00:05<00:14, 148.19it/s]
 29%|██▉       | 892/3060 [00:06<00:14, 148.10it/s]
 30%|██▉       | 907/3060 [00:06<00:14, 148.05it/s]
 30%|███       | 922/3060 [00:06<00:14, 147.93it/s]
 31%|███       | 937/3060 [00:06<00:14, 147.88it/s]
 31%|███       | 952/3060 [00:06<00:14, 147.78it/s]
 32%|███▏      | 967/3060 [00:06<00:14, 147.70it/s]
 32%|███▏      | 982/3060 [00:06<00:14, 147.57it/s]
 33%|███▎      | 997/3060 [00:06<00:13, 147.49it/s]
 33%|███▎      | 1012/3060 [00:06<00:13, 147.44it/s]
 34%|███▎      | 1027/3060 [00:07<00:13, 147.40it/s]
 34%|███▍      | 1042/3060 [00:07<00:13, 147.26it/s]
 35%|███▍      | 1057/3060 [00:07<00:13, 147.20it/s]
 35%|███▌      | 1072/3060 [00:07<00:13, 147.12it/s]
 36%|███▌      | 1087/3060 [00:07<00:13, 147.02it/s]
 36%|███▌      | 1102/3060 [00:07<00:13, 146.94it/s]
 37%|███▋      | 1117/3060 [00:07<00:13, 146.86it/s]
 37%|███▋      | 1132/3060 [00:07<00:13, 146.77it/s]
 37%|███▋      | 1147/3060 [00:07<00:13, 146.71it/s]
 38%|███▊      | 1162/3060 [00:07<00:12, 146.66it/s]
 38%|███▊      | 1177/3060 [00:08<00:12, 146.58it/s]
 39%|███▉      | 1192/3060 [00:08<00:12, 146.52it/s]
 39%|███▉      | 1207/3060 [00:08<00:12, 146.35it/s]
 40%|███▉      | 1222/3060 [00:08<00:12, 146.26it/s]
 40%|████      | 1237/3060 [00:08<00:12, 146.18it/s]
 41%|████      | 1252/3060 [00:08<00:12, 146.11it/s]
 41%|████▏     | 1267/3060 [00:08<00:12, 146.07it/s]
 42%|████▏     | 1282/3060 [00:08<00:12, 146.00it/s]
 42%|████▏     | 1297/3060 [00:08<00:12, 145.93it/s]
 43%|████▎     | 1312/3060 [00:08<00:11, 145.85it/s]
 43%|████▎     | 1327/3060 [00:09<00:11, 145.76it/s]
 44%|████▍     | 1342/3060 [00:09<00:11, 145.63it/s]
 44%|████▍     | 1357/3060 [00:09<00:11, 145.52it/s]
 45%|████▍     | 1372/3060 [00:09<00:11, 145.44it/s]
 45%|████▌     | 1387/3060 [00:09<00:11, 145.31it/s]
 46%|████▌     | 1402/3060 [00:09<00:11, 145.21it/s]
 46%|████▋     | 1417/3060 [00:09<00:11, 145.05it/s]
 47%|████▋     | 1432/3060 [00:09<00:11, 144.97it/s]
 47%|████▋     | 1447/3060 [00:09<00:11, 144.95it/s]
 48%|████▊     | 1462/3060 [00:09<00:11, 144.80it/s]
 48%|████▊     | 1477/3060 [00:10<00:10, 144.70it/s]
 49%|████▉     | 1492/3060 [00:10<00:10, 144.61it/s]
 49%|████▉     | 1507/3060 [00:10<00:10, 144.55it/s]
 50%|████▉     | 1522/3060 [00:10<00:10, 144.51it/s]
 50%|█████     | 1537/3060 [00:10<00:10, 144.48it/s]
 51%|█████     | 1552/3060 [00:10<00:10, 144.43it/s]
 51%|█████     | 1567/3060 [00:10<00:10, 144.37it/s]
 52%|█████▏    | 1582/3060 [00:10<00:10, 144.25it/s]
 52%|█████▏    | 1597/3060 [00:10<00:10, 144.08it/s]
 53%|█████▎    | 1612/3060 [00:11<00:10, 143.96it/s]
 53%|█████▎    | 1627/3060 [00:11<00:09, 143.84it/s]
 54%|█████▎    | 1642/3060 [00:11<00:09, 143.74it/s]
 54%|█████▍    | 1657/3060 [00:11<00:09, 143.67it/s]
 55%|█████▍    | 1672/3060 [00:11<00:09, 143.55it/s]
 55%|█████▌    | 1687/3060 [00:11<00:09, 143.46it/s]
 56%|█████▌    | 1702/3060 [00:11<00:09, 143.45it/s]
 56%|█████▌    | 1717/3060 [00:11<00:09, 140.78it/s]
 57%|█████▋    | 1732/3060 [00:11<00:09, 138.97it/s]
 57%|█████▋    | 1746/3060 [00:11<00:09, 137.73it/s]
 58%|█████▊    | 1760/3060 [00:12<00:09, 136.83it/s]
 58%|█████▊    | 1774/3060 [00:12<00:09, 136.20it/s]
 58%|█████▊    | 1788/3060 [00:12<00:09, 135.70it/s]
 59%|█████▉    | 1802/3060 [00:12<00:09, 135.33it/s]
 59%|█████▉    | 1816/3060 [00:12<00:09, 135.07it/s]
 60%|█████▉    | 1830/3060 [00:12<00:09, 134.86it/s]
 60%|██████    | 1844/3060 [00:12<00:09, 134.85it/s]
 61%|██████    | 1858/3060 [00:12<00:08, 134.82it/s]
 61%|██████    | 1872/3060 [00:12<00:08, 134.81it/s]
 62%|██████▏   | 1886/3060 [00:13<00:08, 134.76it/s]
 62%|██████▏   | 1900/3060 [00:13<00:08, 134.67it/s]
 63%|██████▎   | 1914/3060 [00:13<00:08, 134.58it/s]
 63%|██████▎   | 1928/3060 [00:13<00:08, 134.53it/s]
 63%|██████▎   | 1942/3060 [00:13<00:08, 134.46it/s]
 64%|██████▍   | 1956/3060 [00:13<00:08, 134.40it/s]
 64%|██████▍   | 1970/3060 [00:13<00:08, 134.44it/s]
 65%|██████▍   | 1984/3060 [00:13<00:07, 134.51it/s]
 65%|██████▌   | 1998/3060 [00:13<00:07, 134.55it/s]
 66%|██████▌   | 2012/3060 [00:13<00:07, 134.55it/s]
 66%|██████▌   | 2026/3060 [00:14<00:07, 134.56it/s]
 67%|██████▋   | 2040/3060 [00:14<00:07, 134.51it/s]
 67%|██████▋   | 2054/3060 [00:14<00:07, 134.33it/s]
 68%|██████▊   | 2068/3060 [00:14<00:07, 134.24it/s]
 68%|██████▊   | 2082/3060 [00:14<00:07, 134.20it/s]
 68%|██████▊   | 2096/3060 [00:14<00:07, 134.25it/s]
 69%|██████▉   | 2110/3060 [00:14<00:07, 134.24it/s]
 69%|██████▉   | 2124/3060 [00:14<00:06, 134.28it/s]
 70%|██████▉   | 2138/3060 [00:14<00:06, 134.27it/s]
 70%|███████   | 2152/3060 [00:15<00:06, 134.28it/s]
 71%|███████   | 2166/3060 [00:15<00:06, 134.14it/s]
 71%|███████   | 2180/3060 [00:15<00:06, 134.11it/s]
 72%|███████▏  | 2194/3060 [00:15<00:06, 134.10it/s]
72%|███████▏  | 2198/3060 [00:15<00:06, 143.10it/s]
Version Details
Version ID
79caaf88e47605d71197442eb35361be922488dfb2d55de8ae757cc73d6d2a15
Version Created
February 13, 2025
Run on Replicate →