minimax/speech-02-hd 📝🔢❓✓ → 🖼️
About
Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Optimized for high-fidelity applications like voiceovers and audiobooks.
Example Output
Output
Performance Metrics
2.38s
Prediction Time
2.39s
Total Time
All Input Parameters
{
"text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
"pitch": 0,
"speed": 1,
"volume": 1,
"bitrate": 128000,
"channel": "mono",
"emotion": "happy",
"voice_id": "Friendly_Person",
"sample_rate": 32000,
"language_boost": "English",
"english_normalization": true
}
Input Parameters
- text (required)
- Text to narrate (max 10,000 characters). Use markers like <#0.5#> to insert pauses in seconds.
- pitch
- Semitone offset applied to the voice (−12 to +12).
- speed
- Speech speed multiplier (0.5–2.0). Lower is slower, higher is faster.
- volume
- Relative loudness. 1.0 is default MiniMax gain. Range 0–10.
- bitrate
- MP3 bitrate in bits per second. Only used when audio_format is mp3.
- channel
- mono for 1 channel (default), stereo for 2 channels.
- emotion
- Desired delivery style. Use auto to let MiniMax choose, or pick a specific emotion.
- voice_id
- Voice to synthesize. Pick any MiniMax system voice or a voice_id returned by https://replicate.com/minimax/voice-cloning.
- sample_rate
- Audio sample rate in Hz.
- audio_format
- File format for the generated audio. Choose mp3 for general use, wav/flac for lossless, or pcm for raw bytes.
- language_boost
- Optional language hint. Choose Automatic to let MiniMax detect the language, or pick a specific locale.
- subtitle_enable
- Return MiniMax subtitle metadata with sentence timestamps (non-streaming only).
- english_normalization
- Improve number/date reading for English text (adds a small amount of latency).
Output Schema
Output
Example Execution Logs
Generating speech with model speech-02-hd Generated speech in 2.38sec Each character is 1 token Tokens: 387
Version Details
- Version ID
fdd081f807e655246ef42adbcb3ee9334e7fdc710428684771f90d69992cabb3- Version Created
- November 7, 2025