playht/play-dialog 🔢📝❓ → 🖼️
About
End-to-end AI speech model designed for natural-sounding conversational speech synthesis, with support for context-aware prosody, intonation, and emotional expression.

Example Output
Output
Performance Metrics
26.46s
Prediction Time
26.47s
Total Time
All Input Parameters
{ "text": "Close your eyes gently. Take a deep breath in through your nose, allowing your lungs to fill completely. Hold it for a moment, then exhale slowly and deeply through your mouth. Let any tension you're holding melt away with each exhale. Visualize yourself standing at the edge of a beautiful, tranquil forest. The first rays of morning light stream through the trees, illuminating the delicate dewdrops on leaves and petals. Birds are beginning to sing their morning songs, and the world feels fresh and alive.", "voice": "Nia (Young female US conversational voice)", "prompt": "", "prompt2": "", "voice_2": "None", "language": "english", "turnPrefix": "Voice 1:", "temperature": 1.02, "turnPrefix2": "Voice 2:", "voice_conditioning_seconds": 20, "voice_conditioning_seconds_2": 20 }
Input Parameters
- seed
- Random seed. Set for reproducible generation
- text (required)
- Text for speech generation
- speed
- Control how fast the generated audio should be.
- voice
- Voice to use for generation
- prompt
- A prompt to guide the style of the output generated by the first voice.
- prompt2
- A prompt to guide the style of the output generated by the second voice.
- voice_2
- Optional second voice to use for generation
- language
- The language of the text to be spoken.
- turnPrefix
- The prefix to indicate the start of a turn in a multi-turn dialogue for the first voice.
- temperature
- The temperature parameter controls variance. Lower temperatures result in more predictable results, higher temperatures allow each run to vary more, so the voice may sound less like the baseline voice.
- turnPrefix2
- The prefix to indicate the start of a turn in a multi-turn dialogue for the second voice.
- voice_conditioning_seconds
- The number of seconds of conditioning to use from the selected voice. Lower values generate audio less similar to the cloned voice, but lead to more model stability and expressiveness. Higher values create output more similar to the cloned voice, but can lead to model instability and reduced expressiveness.
- voice_conditioning_seconds_2
- The number of seconds of conditioning to use from the second selected voice.
Output Schema
Output
Example Execution Logs
Using seed: 1958393623 Running prediction... Generating audio... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Still processing... Generated audio in 26.1sec Length of generated audio: 39.24 seconds Downloading 785805 bytes Downloaded 0.75MB in 0.28sec
Version Details
- Version ID
0d5710136b2204bb0a8b927a9e50904af22c2d238b813b7e0cdf8f17f12670f8
- Version Created
- January 13, 2025