zsxkib/dia 🔢📝🖼️🔊 → 🖼️

▶️ 9.2K runs 📅 Apr 2025 ⚙️ Cog 0.15.10 🔗 GitHub ⚖️ License
dialogue multi-speaker-tts text-to-speech voice-cloning

About

Dia 1.6B by Nari Labs, Generates realistic dialogue audio from text, including non-verbal cues and voice cloning

Example Output

Output

Example output

Performance Metrics

24.03s Prediction Time
90.39s Total Time
All Input Parameters
{
  "text": "[S1] It's on Replicate!!! Oh fire! Oh my goodness! What's the procedure? What to we do people? The Dia text-to-speech model just dropped on Replicate!!\n[S2] Oh my god! Okay.. it's happening. Everybody stay calm!\n[S1] What's the procedure...\n[S2] Everybody stay fricking calm!!!... Everybody fudging calm down!!!!!\n[S1] Yes! Yes! Let's try it out at https://replicate.com/zsxkib/dia (laughs)\n[S2] (whispers) try it now (whispers)",
  "top_p": 0.95,
  "cfg_scale": 4,
  "temperature": 1.3,
  "speed_factor": 0.94,
  "max_new_tokens": 3072,
  "cfg_filter_top_k": 35
}
Input Parameters
seed Type: integer
Random seed for reproducible results. Use the same seed value to get the same output for identical inputs. Leave blank for random results each time.
text (required) Type: string
Input text for dialogue generation. Use [S1], [S2] to indicate different speakers and (description) in parentheses for non-verbal cues e.g., (laughs), (whispers).
top_p Type: numberDefault: 0.95Range: 0.1 - 1
Controls diversity of word choice. Higher values include more unusual options. Most users shouldn't need to adjust this parameter.
cfg_scale Type: numberDefault: 3Range: 1 - 5
Controls how closely the audio follows your text. Higher values (3-5) follow text more strictly; lower values may sound more natural but deviate more.
temperature Type: numberDefault: 1.8Range: 1 - 2.5
Controls randomness in generation. Higher values (1.3-2.0) increase variety; lower values make output more consistent. Set to 0 for deterministic (greedy) generation.
audio_prompt Type: string
Optional audio file (.wav/.mp3/.flac) for voice cloning. The model will attempt to mimic this voice style.
speed_factor Type: numberDefault: 1Range: 0.5 - 1.5
Adjusts playback speed of the generated audio. Values below 1.0 slow down the audio; 1.0 is original speed.
max_new_tokens Type: integerDefault: 3072Range: 500 - 4096
Controls the length of generated audio. Higher values create longer audio. (86 tokens ≈ 1 second of audio).
cfg_filter_top_k Type: integerDefault: 45Range: 10 - 100
Technical parameter for filtering audio generation tokens. Higher values allow more diverse sounds; lower values create more consistent audio.
audio_prompt_text Type: string
Optional transcript of the audio prompt. If provided, this will be prepended to the main text input.
max_audio_prompt_seconds Type: integerDefault: 10Range: 1 - 120
Maximum duration in seconds for the input voice cloning audio prompt. Only used when an audio prompt is provided. Longer voice samples will be truncated to this length.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
Random seed set to 37408
Generating audio tokens...
Warning: Clamping 1 indices outside range [0, 1023] to 0.
Token generation finished in 23.99 seconds.
Generated audio shape: (740352,)
Adjusting speed by factor 0.94...
Resampled audio from 740352 to 787608 samples.
Saving audio to /tmp/tmpgvezeg9d/output.wav...
Audio saved in 0.02 seconds.
Total prediction time: 24.01 seconds.
Version Details
Version ID
2119e338ca5c0dacd3def83158d6c80d431f2ac1024146d8cca9220b74385599
Version Created
July 15, 2025
Run on Replicate →