zsxkib/dia 🔢📝🖼️🔊 → 🖼️

▶️ 12.8K runs 📅 Apr 2025 ⚙️ Cog 0.15.10 🔗 GitHub ⚖️ License

dialogue multi-speaker-tts text-to-speech voice-cloning

About

Dia 1.6B by Nari Labs, Generates realistic dialogue audio from text, including non-verbal cues and voice cloning

Example Output

Output

Performance Metrics

24.03s Prediction Time

90.39s Total Time

All Input Parameters

{
  "text": "[S1] It's on Replicate!!! Oh fire! Oh my goodness! What's the procedure? What to we do people? The Dia text-to-speech model just dropped on Replicate!!\n[S2] Oh my god! Okay.. it's happening. Everybody stay calm!\n[S1] What's the procedure...\n[S2] Everybody stay fricking calm!!!... Everybody fudging calm down!!!!!\n[S1] Yes! Yes! Let's try it out at https://replicate.com/zsxkib/dia (laughs)\n[S2] (whispers) try it now (whispers)",
  "top_p": 0.95,
  "cfg_scale": 4,
  "temperature": 1.3,
  "speed_factor": 0.94,
  "max_new_tokens": 3072,
  "cfg_filter_top_k": 35
}

Input Parameters

seed Type: integer: Random seed for reproducible results. Use the same seed value to get the same output for identical inputs. Leave blank for random results each time.
text (required) Type: string: Input text for dialogue generation. Use [S1], [S2] to indicate different speakers and (description) in parentheses for non-verbal cues e.g., (laughs), (whispers).
top_p Type: numberDefault: 0.95Range: 0.1 - 1: Controls diversity of word choice. Higher values include more unusual options. Most users shouldn't need to adjust this parameter.
cfg_scale Type: numberDefault: 3Range: 1 - 5: Controls how closely the audio follows your text. Higher values (3-5) follow text more strictly; lower values may sound more natural but deviate more.
temperature Type: numberDefault: 1.8Range: 1 - 2.5: Controls randomness in generation. Higher values (1.3-2.0) increase variety; lower values make output more consistent. Set to 0 for deterministic (greedy) generation.
audio_prompt Type: string: Optional audio file (.wav/.mp3/.flac) for voice cloning. The model will attempt to mimic this voice style.
speed_factor Type: numberDefault: 1Range: 0.5 - 1.5: Adjusts playback speed of the generated audio. Values below 1.0 slow down the audio; 1.0 is original speed.
max_new_tokens Type: integerDefault: 3072Range: 500 - 4096: Controls the length of generated audio. Higher values create longer audio. (86 tokens ≈ 1 second of audio).
cfg_filter_top_k Type: integerDefault: 45Range: 10 - 100: Technical parameter for filtering audio generation tokens. Higher values allow more diverse sounds; lower values create more consistent audio.
audio_prompt_text Type: string: Optional transcript of the audio prompt. If provided, this will be prepended to the main text input.
max_audio_prompt_seconds Type: integerDefault: 10Range: 1 - 120: Maximum duration in seconds for the input voice cloning audio prompt. Only used when an audio prompt is provided. Longer voice samples will be truncated to this length.

Output Schema

Output

Type: string • Format: uri

Example Execution Logs

Random seed set to 37408
Generating audio tokens...
Warning: Clamping 1 indices outside range [0, 1023] to 0.
Token generation finished in 23.99 seconds.
Generated audio shape: (740352,)
Adjusting speed by factor 0.94...
Resampled audio from 740352 to 787608 samples.
Saving audio to /tmp/tmpgvezeg9d/output.wav...
Audio saved in 0.02 seconds.
Total prediction time: 24.01 seconds.

Version Details

Version ID: 2119e338ca5c0dacd3def83158d6c80d431f2ac1024146d8cca9220b74385599
Version Created: July 15, 2025

Run on Replicate →