lucataco/indextts-2 📝🔢🖼️✓ → 🖼️
About
Emotionally Expressive and Duration-Controlled Text-to-Speech

Example Output
Output
Performance Metrics
5.31s
Prediction Time
5.32s
Total Time
All Input Parameters
{ "text": "You miss 100% of the shots you don't take... but why does it hurt so much? Start today with Index TTS 2, now on Replicate", "top_k": 30, "top_p": 0.8, "num_beams": 3, "temperature": 0.8, "emotion_scale": 1, "speaker_audio": "https://replicate.delivery/pbxt/Nitgz9LwQUvwL4jOcOpJSsKqaJ3jt8puvvWPkrnd46WLjw3H/emmy-woman-emotional.mp3", "length_penalty": 0, "max_mel_tokens": 1500, "randomize_emotion": false, "repetition_penalty": 10, "interval_silence_ms": 200, "max_text_tokens_per_segment": 120 }
Input Parameters
- text (required)
- Text to synthesize.
- top_k
- Top-k sampling for GPT stage.
- top_p
- Top-p nucleus sampling for GPT stage.
- num_beams
- Beam width for GPT stage.
- temperature
- Sampling temperature for GPT stage.
- emotion_text
- Text prompt used to auto-detect emotions via Qwen when provided.
- emotion_audio
- Optional emotion reference audio. Defaults to speaker audio when omitted.
- emotion_scale
- Blend ratio for the emotion reference when both speaker and emotion prompts are used.
- speaker_audio (required)
- Reference audio for the target speaker (16k-48kHz WAV).
- emotion_vector
- Optional comma separated or JSON list of 8 emotion weights to bypass the classifier.
- length_penalty
- Beam search length penalty.
- max_mel_tokens
- Maximum mel tokens to generate per segment.
- randomize_emotion
- Pick emotion embeddings randomly instead of nearest-neighbour selection when vectors are provided.
- repetition_penalty
- Penalty for repeated tokens.
- interval_silence_ms
- Silence inserted between long segments in milliseconds.
- max_text_tokens_per_segment
- Maximum BPE tokens per autoregressive segment.
Output Schema
Output
Example Execution Logs
>> starting inference... Use the specified emotion vector 0%| | 0/25 [00:00<?, ?it/s] 4%|▍ | 1/25 [00:00<00:02, 9.23it/s] 24%|██▍ | 6/25 [00:00<00:00, 30.92it/s] 40%|████ | 10/25 [00:00<00:00, 33.68it/s] 56%|█████▌ | 14/25 [00:00<00:00, 35.00it/s] 72%|███████▏ | 18/25 [00:00<00:00, 35.67it/s] 88%|████████▊ | 22/25 [00:00<00:00, 36.06it/s] 100%|██████████| 25/25 [00:00<00:00, 34.16it/s] torch.Size([1, 209408]) >> gpt_gen_time: 4.11 seconds >> gpt_forward_time: 0.01 seconds >> s2mel_time: 0.74 seconds >> bigvgan_time: 0.14 seconds >> Total inference time: 5.19 seconds >> Generated audio length: 9.50 seconds >> RTF: 0.5461 >> wav file saved to: /tmp/tmpu5zjx8ba/output.wav
Version Details
- Version ID
b219b0f22f95fd97cb2c8e3bbea6827a450a7fff05674c996d83171d70b3f685
- Version Created
- September 15, 2025