zsxkib/dia 🔢📝🖼️🔊 → 🖼️
About
Dia 1.6B by Nari Labs, Generates realistic dialogue audio from text, including non-verbal cues and voice cloning

Example Output
Output
Performance Metrics
24.03s
Prediction Time
90.39s
Total Time
All Input Parameters
{ "text": "[S1] It's on Replicate!!! Oh fire! Oh my goodness! What's the procedure? What to we do people? The Dia text-to-speech model just dropped on Replicate!!\n[S2] Oh my god! Okay.. it's happening. Everybody stay calm!\n[S1] What's the procedure...\n[S2] Everybody stay fricking calm!!!... Everybody fudging calm down!!!!!\n[S1] Yes! Yes! Let's try it out at https://replicate.com/zsxkib/dia (laughs)\n[S2] (whispers) try it now (whispers)", "top_p": 0.95, "cfg_scale": 4, "temperature": 1.3, "speed_factor": 0.94, "max_new_tokens": 3072, "cfg_filter_top_k": 35 }
Input Parameters
- seed
- Random seed for reproducible results. Use the same seed value to get the same output for identical inputs. Leave blank for random results each time.
- text (required)
- Input text for dialogue generation. Use [S1], [S2] to indicate different speakers and (description) in parentheses for non-verbal cues e.g., (laughs), (whispers).
- top_p
- Controls diversity of word choice. Higher values include more unusual options. Most users shouldn't need to adjust this parameter.
- cfg_scale
- Controls how closely the audio follows your text. Higher values (3-5) follow text more strictly; lower values may sound more natural but deviate more.
- temperature
- Controls randomness in generation. Higher values (1.3-2.0) increase variety; lower values make output more consistent. Set to 0 for deterministic (greedy) generation.
- audio_prompt
- Optional audio file (.wav/.mp3/.flac) for voice cloning. The model will attempt to mimic this voice style.
- speed_factor
- Adjusts playback speed of the generated audio. Values below 1.0 slow down the audio; 1.0 is original speed.
- max_new_tokens
- Controls the length of generated audio. Higher values create longer audio. (86 tokens ≈ 1 second of audio).
- cfg_filter_top_k
- Technical parameter for filtering audio generation tokens. Higher values allow more diverse sounds; lower values create more consistent audio.
- audio_prompt_text
- Optional transcript of the audio prompt. If provided, this will be prepended to the main text input.
- max_audio_prompt_seconds
- Maximum duration in seconds for the input voice cloning audio prompt. Only used when an audio prompt is provided. Longer voice samples will be truncated to this length.
Output Schema
Output
Example Execution Logs
Random seed set to 37408 Generating audio tokens... Warning: Clamping 1 indices outside range [0, 1023] to 0. Token generation finished in 23.99 seconds. Generated audio shape: (740352,) Adjusting speed by factor 0.94... Resampled audio from 740352 to 787608 samples. Saving audio to /tmp/tmpgvezeg9d/output.wav... Audio saved in 0.02 seconds. Total prediction time: 24.01 seconds.
Version Details
- Version ID
2119e338ca5c0dacd3def83158d6c80d431f2ac1024146d8cca9220b74385599
- Version Created
- July 15, 2025