afiaka87/tortoise-tts 🔢📝❓🖼️ → 🖼️

▶️ 173.1K runs 📅 Aug 2022 ⚙️ Cog 0.3.13 🔗 GitHub 📄 Paper ⚖️ License

text-to-speech voice-cloning

About

Generate speech from text, clone voices from mp3 files. From James Betker AKA "neonbjb".

Example Output

Output

Performance Metrics

235.45s Prediction Time

459.51s Total Time

All Input Parameters

{
  "seed": 0,
  "text": "The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.",
  "preset": "fast",
  "voice_a": "custom_voice",
  "voice_b": "disabled",
  "voice_c": "disabled",
  "custom_voice": "https://replicate.delivery/mgxm/671f3086-382f-4850-be82-db853e5f05a8/nixon.mp3"
}

Input Parameters

seed Type: integerDefault: 0: Random seed which can be used to reproduce results.
text Type: stringDefault: The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.: Text to speak.
preset Default: fast: Which voice preset to use. See the documentation for more information.
voice_a Default: random: Selects the voice to use for generation. Use `random` to select a random voice. Use `custom_voice` to use a custom voice.
voice_b Default: disabled: (Optional) Create new voice from averaging the latents for `voice_a`, `voice_b` and `voice_c`. Use `disabled` to disable voice mixing.
voice_c Default: disabled: (Optional) Create new voice from averaging the latents for `voice_a`, `voice_b` and `voice_c`. Use `disabled` to disable voice mixing.
cvvp_amount Type: numberDefault: 0Range: 0 - 1: How much the CVVP model should influence the output. Increasing this can in some cases reduce the likelyhood of multiple speakers. Defaults to 0 (disabled)
custom_voice Type: string: (Optional) Create a custom voice based on an mp3 file of a speaker. Audio should be at least 15 seconds, only contain one speaker, and be in mp3 format. Overrides the `voice_a` input.

Output Schema

Output

Type: string • Format: uri

Example Execution Logs

Creating voice from /tmp/tmpn3ll0ogznixon.mp3
[1;33m[1;33m[1;33m[1;33m[1;33m[1;33mWARNING[1;0m[1;0m[1;0m[1;0m[1;0m[1;0m: Input file had loudness range of 10.4, which is larger than the loudness range target (7.0). Normalization will revert to dynamic mode. Choose a higher target loudness range if you want linear normalization.
[1;33m[1;33m[1;33m[1;33m[1;33m[1;33mWARNING[1;0m[1;0m[1;0m[1;0m[1;0m[1;0m: In dynamic mode, the sample rate will automatically be set to 192 kHz by the loudnorm filter. Specify -ar/--sample-rate to override it.
[wav @ 0x560e9a45a3c0] ignoring wrong sample_count 55165030
[wav @ 0x560e9a45a3c0] Estimating duration from bitrate, this may be inaccurate
Generating text using voices: ['custom_voice']
Generating autoregressive samples..

  0%|          | 0/6 [00:00<?, ?it/s]
 17%|█▋        | 1/6 [00:05<00:28,  5.80s/it]
 33%|███▎      | 2/6 [00:11<00:22,  5.57s/it]
 50%|█████     | 3/6 [00:16<00:16,  5.65s/it]
 67%|██████▋   | 4/6 [00:22<00:10,  5.43s/it]
 83%|████████▎ | 5/6 [00:27<00:05,  5.54s/it]
100%|██████████| 6/6 [00:33<00:00,  5.61s/it]
100%|██████████| 6/6 [00:33<00:00,  5.59s/it]
Computing best candidates using CLVP

  0%|          | 0/6 [00:00<?, ?it/s]
 17%|█▋        | 1/6 [00:00<00:01,  3.79it/s]
 33%|███▎      | 2/6 [00:01<00:02,  1.46it/s]
 50%|█████     | 3/6 [00:02<00:02,  1.22it/s]
 67%|██████▋   | 4/6 [00:03<00:01,  1.14it/s]
 83%|████████▎ | 5/6 [00:04<00:00,  1.09it/s]
100%|██████████| 6/6 [00:05<00:00,  1.06it/s]
100%|██████████| 6/6 [00:05<00:00,  1.16it/s]

Version Details

Version ID: e9658de4b325863c4fcdc12d94bb7c9b54cbfe351b7ca1b36860008172b91c71
Version Created: August 2, 2022

Run on Replicate →