lucataco/indextts-2 📝🔢🖼️✓ → 🖼️

▶️ 1.1K runs 📅 Sep 2025 ⚙️ Cog 0.16.2 🔗 GitHub 📄 Paper ⚖️ License
emotion-control speech-style-transfer text-to-speech voice-cloning

About

Emotionally Expressive and Duration-Controlled Text-to-Speech

Example Output

Output

Example output

Performance Metrics

5.31s Prediction Time
5.32s Total Time
All Input Parameters
{
  "text": "You miss 100% of the shots you don't take... but why does it hurt so much? Start today with Index TTS 2, now on Replicate",
  "top_k": 30,
  "top_p": 0.8,
  "num_beams": 3,
  "temperature": 0.8,
  "emotion_scale": 1,
  "speaker_audio": "https://replicate.delivery/pbxt/Nitgz9LwQUvwL4jOcOpJSsKqaJ3jt8puvvWPkrnd46WLjw3H/emmy-woman-emotional.mp3",
  "length_penalty": 0,
  "max_mel_tokens": 1500,
  "randomize_emotion": false,
  "repetition_penalty": 10,
  "interval_silence_ms": 200,
  "max_text_tokens_per_segment": 120
}
Input Parameters
text (required) Type: string
Text to synthesize.
top_k Type: integerDefault: 30Range: 1 - 200
Top-k sampling for GPT stage.
top_p Type: numberDefault: 0.8Range: 0 - 1
Top-p nucleus sampling for GPT stage.
num_beams Type: integerDefault: 3Range: 1 - 8
Beam width for GPT stage.
temperature Type: numberDefault: 0.8Range: 0 - 2
Sampling temperature for GPT stage.
emotion_text Type: string
Text prompt used to auto-detect emotions via Qwen when provided.
emotion_audio Type: string
Optional emotion reference audio. Defaults to speaker audio when omitted.
emotion_scale Type: numberDefault: 1Range: 0 - 1
Blend ratio for the emotion reference when both speaker and emotion prompts are used.
speaker_audio (required) Type: string
Reference audio for the target speaker (16k-48kHz WAV).
emotion_vector Type: string
Optional comma separated or JSON list of 8 emotion weights to bypass the classifier.
length_penalty Type: numberDefault: 0Range: 0 - 5
Beam search length penalty.
max_mel_tokens Type: integerDefault: 1500Range: 256 - 4096
Maximum mel tokens to generate per segment.
randomize_emotion Type: booleanDefault: false
Pick emotion embeddings randomly instead of nearest-neighbour selection when vectors are provided.
repetition_penalty Type: numberDefault: 10Range: 1 - 30
Penalty for repeated tokens.
interval_silence_ms Type: integerDefault: 200Range: 0 - 2000
Silence inserted between long segments in milliseconds.
max_text_tokens_per_segment Type: integerDefault: 120Range: 32 - 300
Maximum BPE tokens per autoregressive segment.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
>> starting inference...
Use the specified emotion vector
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:00<00:02,  9.23it/s]
 24%|██▍       | 6/25 [00:00<00:00, 30.92it/s]
 40%|████      | 10/25 [00:00<00:00, 33.68it/s]
 56%|█████▌    | 14/25 [00:00<00:00, 35.00it/s]
 72%|███████▏  | 18/25 [00:00<00:00, 35.67it/s]
 88%|████████▊ | 22/25 [00:00<00:00, 36.06it/s]
100%|██████████| 25/25 [00:00<00:00, 34.16it/s]
torch.Size([1, 209408])
>> gpt_gen_time: 4.11 seconds
>> gpt_forward_time: 0.01 seconds
>> s2mel_time: 0.74 seconds
>> bigvgan_time: 0.14 seconds
>> Total inference time: 5.19 seconds
>> Generated audio length: 9.50 seconds
>> RTF: 0.5461
>> wav file saved to: /tmp/tmpu5zjx8ba/output.wav
Version Details
Version ID
b219b0f22f95fd97cb2c8e3bbea6827a450a7fff05674c996d83171d70b3f685
Version Created
September 15, 2025
Run on Replicate →