minimax/speech-02-turbo 📝🔢❓✓ → 🖼️

⭐ Official ▶️ 10.1M runs 📅 May 2025 ⚙️ Cog 0.16.8 ⚖️ License

multilingual multilingual-tts real-time real-time-tts text-to-speech voice-cloning

About

Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Designed for real-time applications with low latency

Example Output

Output

Performance Metrics

2.37s Prediction Time

2.38s Total Time

All Input Parameters

{
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "angry",
  "voice_id": "Deep_Voice_Man",
  "sample_rate": 32000,
  "language_boost": "English",
  "english_normalization": true
}

Input Parameters

text (required) Type: string: Text to narrate (max 10,000 characters). Use markers like <#0.5#> to insert pauses in seconds.
pitch Type: integerDefault: 0Range: -12 - 12: Semitone offset applied to the voice (−12 to +12).
speed Type: numberDefault: 1Range: 0.5 - 2: Speech speed multiplier (0.5–2.0). Lower is slower, higher is faster.
volume Type: numberDefault: 1Range: 0 - 10: Relative loudness. 1.0 is default MiniMax gain. Range 0–10.
bitrate Default: 128000: MP3 bitrate in bits per second. Only used when audio_format is mp3.
channel Default: mono: mono for 1 channel (default), stereo for 2 channels.
emotion Default: auto: Desired delivery style. Use auto to let MiniMax choose, or pick a specific emotion.
voice_id Type: stringDefault: Wise_Woman: Voice to synthesize. Pick any MiniMax system voice or a voice_id returned by https://replicate.com/minimax/voice-cloning.
sample_rate Default: 32000: Audio sample rate in Hz.
audio_format Default: mp3: File format for the generated audio. Choose mp3 for general use, wav/flac for lossless, or pcm for raw bytes.
language_boost Default: None: Optional language hint. Choose Automatic to let MiniMax detect the language, or pick a specific locale.
subtitle_enable Type: booleanDefault: false: Return MiniMax subtitle metadata with sentence timestamps (non-streaming only).
english_normalization Type: booleanDefault: false: Improve number/date reading for English text (adds a small amount of latency).

Output Schema

Output

Type: string • Format: uri

Example Execution Logs

Generating speech with model speech-02-turbo
Generated speech in 2.35sec
Each character is 1 token
Tokens: 387

Version Details

Version ID: e2e8812b45eefa93b20990418480fe628ddce470f9b72909a175d65e288ff3d5
Version Created: November 7, 2025

Run on Replicate →