minimax/speech-02-hd 📝🔢❓✓ → 🖼️

⭐ Official ▶️ 861.1K runs 📅 May 2025 ⚙️ Cog 0.14.7 ⚖️ License
multilingual multilingual-tts text-to-speech voice-cloning

About

Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Optimized for high-fidelity applications like voiceovers and audiobooks.

Example Output

Output

Example output

Performance Metrics

2.38s Prediction Time
2.39s Total Time
All Input Parameters
{
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "happy",
  "voice_id": "Friendly_Person",
  "sample_rate": 32000,
  "language_boost": "English",
  "english_normalization": true
}
Input Parameters
text (required) Type: string
Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).
pitch Type: integerDefault: 0Range: -12 - 12
Speech pitch
speed Type: numberDefault: 1Range: 0.5 - 2
Speech speed
volume Type: numberDefault: 1Range: 0 - 10
Speech volume
bitrate Default: 128000
Bitrate for the generated speech
channel Default: mono
Number of audio channels
emotion Default: auto
Speech emotion
voice_id Type: stringDefault: Wise_Woman
Desired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl
sample_rate Default: 32000
Sample rate for the generated speech
language_boost Default: None
Enhance recognition of specific languages and dialects
english_normalization Type: booleanDefault: false
Enable English text normalization for better number reading (slightly increases latency)
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
Generating speech with model speech-02-hd
Generated speech in 2.38sec
Each character is 1 token
Tokens: 387
Version Details
Version ID
29657f664032844b8f800486164cf26acb2507288e348133e78ae871a43211d0
Version Created
May 6, 2025
Run on Replicate →