prunaai/dia-1.6b 🔢📝 → 🖼️

▶️ 1.8K runs 📅 Apr 2025 ⚙️ Cog 0.14.1
multi-speaker multi-speaker-tts text-to-speech

About

Example Output

Output

Example output

Performance Metrics

20.49s Prediction Time
20.50s Total Time
All Input Parameters
{
  "seed": -1,
  "text": "[S1] It's on Replicate!!! Oh fire! Oh my goodness! What's the procedure? What do we do people? The Dia text-to-speech model — now Pruna-optimized — just dropped on Replicate!!\n\n[S2] Oh my god! Okay… it's happening. Everybody stay calm!\n\n[S1] What's the procedure…\n\n[S2] Everybody stay fricking calm!!!... Everybody fudging calm down!!!!!\n\n[S1] Yes! Yes! Let's try it out at prunaai/dia-1.6b (laughs) — powered up and made leaner with Pruna!\n\n[S2] (whispers) try it now… (whispers) turbocharged by Pruna…",
  "top_p": 0.95,
  "cfg_scale": 3,
  "temperature": 1.3,
  "speed_factor": 0.94,
  "max_new_tokens": 3072,
  "cfg_filter_top_k": 35
}
Input Parameters
seed Type: integerDefault: -1
Random seed for reproducible results. Use the same seed value to get the same output for identical inputs. Leave blank for random results each time.
text (required) Type: string
Input text for dialogue generation. Use [S1], [S2] to indicate different speakers and (description) in parentheses for non-verbal cues e.g., (laughs), (whispers).
top_p Type: numberDefault: 0.95Range: 0.1 - 1
Controls diversity of word choice. Higher values include more unusual options. Most users shouldn't need to adjust this parameter.
cfg_scale Type: numberDefault: 3Range: 1 - 5
Controls how closely the audio follows your text. Higher values (3-5) follow text more strictly; lower values may sound more natural but deviate more.
temperature Type: numberDefault: 1.3Range: 0.1 - 2
Controls randomness in generation. Higher values (1.3-2.0) increase variety; lower values (0.1-0.9) make output more consistent and predictable.
speed_factor Type: numberDefault: 0.94Range: 0.5 - 1.5
Adjusts playback speed of the generated audio. Values below 1.0 slow down the audio; 1.0 is original speed.
max_new_tokens Type: integerDefault: 3072Range: 500 - 4096
Controls the length of generated audio. Higher values create longer audio. (86 tokens ≈ 1 second of audio).
cfg_filter_top_k Type: integerDefault: 35Range: 10 - 100
Technical parameter for filtering audio generation tokens. Higher values allow more diverse sounds; lower values create more consistent audio.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
Random seed set to 16609
Generating audio tokens...
Token generation finished in 20.46 seconds.
Generated audio shape: (814080,)
Adjusting speed by factor 0.94...
Resampled audio from 814080 to 866042 samples.
Saving audio to /tmp/tmpnfzzw8pj/output.wav...
Audio saved in 0.02 seconds.
Total prediction time: 20.48 seconds.
Version Details
Version ID
5e364f84db4cd1916990138229e14179036a1bfcf20a39c4b0f3214e33a5c48a
Version Created
April 24, 2025
Run on Replicate →