dessix/moss-ttsd 🔢📝✓🖼️ → 🖼️
About
MOSS-TTSD (text to spoken dialogue) is an open-source bilingual spoken dialogue synthesis model that supports both Chinese and English. It can transform dialogue scripts between two speakers into natural, expressive conversational speech.

Example Output
Output
Performance Metrics
22.04s
Prediction Time
22.06s
Total Time
All Input Parameters
{ "seed": 42, "text": "[S1]诶,跟你说个事儿啊,我最近听了不少那种AI生成的播客,不知道你有没有听过。[S2]哦,听过一些。怎么了,感觉怎么样?[S1]就是……怎么说呢,单听一句话,你觉得,哇,好像跟真人没啥区别。[S2]嗯。[S1]但是,你只要让它说上一段完整的对话,比如俩人聊天那种,那个感觉就立马不对了。[S2]对对对,我懂你的意思。就是那个所谓的“恐怖谷”效应,是吧?听着有点瘆人,感觉特别假,没有那个交流感。[S1]就是这个词儿,恐怖谷。结果你猜怎么着,这个事儿最近好像有救了。", "use_normalize": true, "reference_text_speaker1": "周一到周五每天早晨七点半到九点半的直播片段,言下之意呢就是废话有点多,大家也别嫌弃,因为这都是直播间最真实的状态了", "reference_text_speaker2": "如果大家想听到更丰富更及时的直播内容,记得在周一到周五准时进入直播间,和大家一起畅聊新消费新科技新趋势", "reference_audio_speaker1": "https://replicate.delivery/pbxt/NQ5EtOr3vU5lQdCgXwDM5ywZRwEez5LYv7Mx3XOHHe8vWqQB/zh_spk1_moon.wav", "reference_audio_speaker2": "https://replicate.delivery/pbxt/NQ5Eu4HdpF1khhnx36ivEFa54pxlpBHg7Zx8RLrzIxlEutKx/zh_spk2_moon.wav", "reference_audio_speaker1_base64": "", "reference_audio_speaker2_base64": "" }
Input Parameters
- seed
- Random seed for reproducibility
- text
- Dialogue text, format: [S1]Speaker 1 content[S2]Speaker 2 content[S1]...
- use_normalize
- Whether to use text normalization (recommended for better handling of numbers, punctuation, etc.)
- reference_text_speaker1
- Reference text for speaker 1 (corresponding to reference audio)
- reference_text_speaker2
- Reference text for speaker 2 (corresponding to reference audio)
- reference_audio_speaker1
- Reference audio file for speaker 1 (optional, for voice cloning)
- reference_audio_speaker2
- 说话者2的参考音频文件(可选,用于声音克隆)/ Reference audio file for speaker 2 (optional, for voice cloning)
Output Schema
Output
Example Execution Logs
Processing text: [S1]诶,跟你说个事儿啊,我最近听了不少那种AI生成的播客,不知道你有没有听过。[S2]哦,听过一些。怎么了,感觉怎么样?[S1]就是……怎么说呢,单听一句话,你觉得,哇,好像跟真人没啥区别。[S2... Using voice cloning with reference audio Processing 1 samples starting from index 0... Using speaker1 and speaker2 information for prompt audio and text. Starting batch audio generation... Original outputs shape: torch.Size([1, 823, 8]) Start value: 467 Shape after slicing: torch.Size([1, 356, 8]) MAX_CHANNELS: 8 Calculated seq_len: 816 Speech token shape for sample 0: torch.Size([349, 8]) Audio generation completed: sample 0 Successfully generated audio with duration: 27.92 seconds
Version Details
- Version ID
45485ef4ee8ad08ee718ed2ae59d4c8f1cb682ad7842f1a31092c6fbe7c575f5
- Version Created
- August 13, 2025