zsxkib/kimi-audio-7b-instruct đŧī¸đđĸââ â â
About
đ§ Kimi-Audio-7B-Instruct, ASR, audio reasoning, captioning, emotion sensing, and TTS into one universal model đ

Example Output
Output
{"json_str":"I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!","media_path":"https://replicate.delivery/xezq/y6NyEvrmX47HCN7y4HtEAvxNwdcHI7Pk2vwv3l28O0Bg8EKF/output.wav"}
Performance Metrics
5.50s
Prediction Time
5.50s
Total Time
All Input Parameters
{ "audio": "https://replicate.delivery/pbxt/MvvLnD9djueaX3u8fCEm2bptKjp6j1DXIRB1MFncD9ZGxCvF/replicate-prediction-r6ns1e52whrga0cphha96k3axm.wav", "text_top_k": 5, "audio_top_k": 10, "output_type": "both", "return_json": true, "text_temperature": 0, "audio_temperature": 0.8, "text_repetition_penalty": 1, "audio_repetition_penalty": 1, "text_repetition_window_size": 16, "audio_repetition_window_size": 64 }
Input Parameters
- audio (required)
- Input audio file for processing. Can be used for speech-to-text (ASR) or audio-to-audio generation.
- prompt
- Optional text prompt to guide the model. For ASR, use prompts like 'Please convert this audio to text' or '蝎å°éŗéĸå 厚čŊŦæĸä¸ēæå' (Chinese).
- text_top_k
- Top-k for text generation. Limits the token selection to the k most likely tokens.
- audio_top_k
- Top-k for audio generation. Limits the token selection to the k most likely tokens.
- output_type
- Type of output to generate: 'audio' for audio only, 'text' for transcription only, or 'both' for both audio and text responses.
- return_json
- Return text results in JSON format instead of text file
- text_temperature
- Temperature for text generation. Lower values (0.0-0.5) increase factual accuracy.
- audio_temperature
- Temperature for audio generation. Higher values (0.8-1.0) increase creativity but may reduce coherence.
- text_repetition_penalty
- Repetition penalty for text. Values > 1.0 discourage repetition in text generation.
- audio_repetition_penalty
- Repetition penalty for audio. Values > 1.0 discourage repetition in audio generation.
- text_repetition_window_size
- Window size for text repetition penalty calculation.
- audio_repetition_window_size
- Window size for audio repetition penalty calculation.
Output Schema
- json_str
- Json Str
- media_path
- Media Path
Example Execution Logs
Generating tokens: 0%| | 0/1463 [00:00<?, ?it/s] Generating tokens: 0%| | 4/1463 [00:00<00:45, 32.39it/s] Generating tokens: 1%| | 8/1463 [00:00<00:43, 33.64it/s] Generating tokens: 1%| | 12/1463 [00:00<00:42, 34.09it/s] Generating tokens: 1%| | 16/1463 [00:00<00:42, 34.31it/s] Generating tokens: 1%|â | 20/1463 [00:00<00:41, 34.45it/s] Generating tokens: 2%|â | 24/1463 [00:00<00:41, 34.53it/s] Generating tokens: 2%|â | 28/1463 [00:00<00:41, 34.59it/s] Generating tokens: 2%|â | 32/1463 [00:00<00:41, 34.62it/s] Generating tokens: 2%|â | 36/1463 [00:01<00:41, 34.63it/s] Generating tokens: 3%|â | 40/1463 [00:01<00:41, 34.63it/s] Generating tokens: 3%|â | 44/1463 [00:01<00:40, 34.62it/s] Generating tokens: 3%|â | 48/1463 [00:01<00:40, 34.61it/s] Generating tokens: 4%|â | 52/1463 [00:01<00:40, 34.62it/s] Generating tokens: 4%|â | 56/1463 [00:01<00:40, 34.61it/s] Generating tokens: 4%|â | 60/1463 [00:01<00:40, 34.63it/s] Generating tokens: 4%|â | 64/1463 [00:01<00:40, 34.61it/s] Generating tokens: 5%|â | 68/1463 [00:01<00:40, 34.58it/s] Generating tokens: 5%|â | 72/1463 [00:02<00:40, 34.56it/s] Generating tokens: 5%|â | 76/1463 [00:02<00:40, 34.57it/s] Generating tokens: 5%|â | 80/1463 [00:02<00:40, 34.55it/s] Generating tokens: 6%|â | 84/1463 [00:02<00:39, 34.55it/s] Generating tokens: 6%|â | 88/1463 [00:02<00:39, 34.56it/s] Generating tokens: 6%|â | 92/1463 [00:02<00:39, 34.58it/s] Generating tokens: 7%|â | 96/1463 [00:02<00:39, 34.57it/s] Generating tokens: 7%|â | 100/1463 [00:02<00:39, 34.54it/s] Generating tokens: 7%|â | 104/1463 [00:03<00:39, 34.54it/s] Generating tokens: 7%|â | 108/1463 [00:03<00:39, 34.53it/s] Generating tokens: 8%|â | 112/1463 [00:03<00:39, 34.51it/s] Generating tokens: 8%|â | 116/1463 [00:03<00:39, 34.52it/s] Generating tokens: 8%|â | 120/1463 [00:03<00:38, 34.53it/s] Generating tokens: 8%|â | 124/1463 [00:03<00:38, 34.54it/s] Generating tokens: 9%|â | 128/1463 [00:03<00:38, 34.55it/s] Generating tokens: 9%|â | 132/1463 [00:03<00:38, 34.54it/s] Generating tokens: 9%|â | 132/1463 [00:03<00:38, 34.25it/s] Written output to /tmp/output/output.wav >>> output text: I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks! Written output to /tmp/output/output.txt
Version Details
- Version ID
7500b32387695e89da3d09271850319ba027969f0c714dfc226361609ff29f2b
- Version Created
- May 1, 2025