afiaka87/tortoise-tts 🔢📝❓🖼️ → 🖼️

▶️ 172.5K runs 📅 Aug 2022 ⚙️ Cog 0.3.13 🔗 GitHub 📄 Paper ⚖️ License
text-to-speech voice-cloning

About

Generate speech from text, clone voices from mp3 files. From James Betker AKA "neonbjb".

Example Output

Output

Example output

Performance Metrics

235.45s Prediction Time
459.51s Total Time
All Input Parameters
{
  "seed": 0,
  "text": "The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.",
  "preset": "fast",
  "voice_a": "custom_voice",
  "voice_b": "disabled",
  "voice_c": "disabled",
  "custom_voice": "https://replicate.delivery/mgxm/671f3086-382f-4850-be82-db853e5f05a8/nixon.mp3"
}
Input Parameters
seed Type: integerDefault: 0
Random seed which can be used to reproduce results.
text Type: stringDefault: The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.
Text to speak.
preset Default: fast
Which voice preset to use. See the documentation for more information.
voice_a Default: random
Selects the voice to use for generation. Use `random` to select a random voice. Use `custom_voice` to use a custom voice.
voice_b Default: disabled
(Optional) Create new voice from averaging the latents for `voice_a`, `voice_b` and `voice_c`. Use `disabled` to disable voice mixing.
voice_c Default: disabled
(Optional) Create new voice from averaging the latents for `voice_a`, `voice_b` and `voice_c`. Use `disabled` to disable voice mixing.
cvvp_amount Type: numberDefault: 0Range: 0 - 1
How much the CVVP model should influence the output. Increasing this can in some cases reduce the likelyhood of multiple speakers. Defaults to 0 (disabled)
custom_voice Type: string
(Optional) Create a custom voice based on an mp3 file of a speaker. Audio should be at least 15 seconds, only contain one speaker, and be in mp3 format. Overrides the `voice_a` input.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
Creating voice from /tmp/tmpn3ll0ogznixon.mp3
WARNING: Input file had loudness range of 10.4, which is larger than the loudness range target (7.0). Normalization will revert to dynamic mode. Choose a higher target loudness range if you want linear normalization.
WARNING: In dynamic mode, the sample rate will automatically be set to 192 kHz by the loudnorm filter. Specify -ar/--sample-rate to override it.
[wav @ 0x560e9a45a3c0] ignoring wrong sample_count 55165030
[wav @ 0x560e9a45a3c0] Estimating duration from bitrate, this may be inaccurate
Generating text using voices: ['custom_voice']
Generating autoregressive samples..

  0%|          | 0/6 [00:00<?, ?it/s]
 17%|█▋        | 1/6 [00:05<00:28,  5.80s/it]
 33%|███▎      | 2/6 [00:11<00:22,  5.57s/it]
 50%|█████     | 3/6 [00:16<00:16,  5.65s/it]
 67%|██████▋   | 4/6 [00:22<00:10,  5.43s/it]
 83%|████████▎ | 5/6 [00:27<00:05,  5.54s/it]
100%|██████████| 6/6 [00:33<00:00,  5.61s/it]
100%|██████████| 6/6 [00:33<00:00,  5.59s/it]
Computing best candidates using CLVP

  0%|          | 0/6 [00:00<?, ?it/s]
 17%|█▋        | 1/6 [00:00<00:01,  3.79it/s]
 33%|███▎      | 2/6 [00:01<00:02,  1.46it/s]
 50%|█████     | 3/6 [00:02<00:02,  1.22it/s]
 67%|██████▋   | 4/6 [00:03<00:01,  1.14it/s]
 83%|████████▎ | 5/6 [00:04<00:00,  1.09it/s]
100%|██████████| 6/6 [00:05<00:00,  1.06it/s]
100%|██████████| 6/6 [00:05<00:00,  1.16it/s]
Version Details
Version ID
e9658de4b325863c4fcdc12d94bb7c9b54cbfe351b7ca1b36860008172b91c71
Version Created
August 2, 2022
Run on Replicate →