zsxkib/kimi-audio-7b-instruct 🖼️📝🔢❓✓ → ❓

▶️ 3.7K runs 📅 May 2025 ⚙️ Cog 0.14.7 🔗 GitHub 📄 Paper ⚖️ License

audio-classification audio-to-audio speech-to-text text-to-speech

Performance

5.5sTypical run time

3.7KTotal runs

About

🎧 Kimi-Audio-7B-Instruct, ASR, audio reasoning, captioning, emotion sensing, and TTS into one universal model 🔊

Example Output

Output

{"json_str":"I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!","media_path":"https://replicate.delivery/xezq/y6NyEvrmX47HCN7y4HtEAvxNwdcHI7Pk2vwv3l28O0Bg8EKF/output.wav"}

Performance Metrics

5.50s Prediction Time

5.50s Total Time

All Input Parameters

{
  "audio": "https://replicate.delivery/pbxt/MvvLnD9djueaX3u8fCEm2bptKjp6j1DXIRB1MFncD9ZGxCvF/replicate-prediction-r6ns1e52whrga0cphha96k3axm.wav",
  "text_top_k": 5,
  "audio_top_k": 10,
  "output_type": "both",
  "return_json": true,
  "text_temperature": 0,
  "audio_temperature": 0.8,
  "text_repetition_penalty": 1,
  "audio_repetition_penalty": 1,
  "text_repetition_window_size": 16,
  "audio_repetition_window_size": 64
}

Input Parameters

audio (required) Type: string: Input audio file for processing. Can be used for speech-to-text (ASR) or audio-to-audio generation.
prompt Type: string: Optional text prompt to guide the model. For ASR, use prompts like 'Please convert this audio to text' or '请将音频内容转换为文字' (Chinese).
text_top_k Type: integerDefault: 5: Top-k for text generation. Limits the token selection to the k most likely tokens.
audio_top_k Type: integerDefault: 10: Top-k for audio generation. Limits the token selection to the k most likely tokens.
output_type Default: both: Type of output to generate: 'audio' for audio only, 'text' for transcription only, or 'both' for both audio and text responses.
return_json Type: booleanDefault: true: Return text results in JSON format instead of text file
text_temperature Type: numberDefault: 0: Temperature for text generation. Lower values (0.0-0.5) increase factual accuracy.
audio_temperature Type: numberDefault: 0.8: Temperature for audio generation. Higher values (0.8-1.0) increase creativity but may reduce coherence.
text_repetition_penalty Type: numberDefault: 1: Repetition penalty for text. Values > 1.0 discourage repetition in text generation.
audio_repetition_penalty Type: numberDefault: 1: Repetition penalty for audio. Values > 1.0 discourage repetition in audio generation.
text_repetition_window_size Type: integerDefault: 16: Window size for text repetition penalty calculation.
audio_repetition_window_size Type: integerDefault: 64: Window size for audio repetition penalty calculation.

Output Schema

json_str Type: string: Json Str
media_path Type: stringFormat: uri: Media Path

Example Execution Logs

Generating tokens:   0%|          | 0/1463 [00:00<?, ?it/s]
Generating tokens:   0%|          | 4/1463 [00:00<00:45, 32.39it/s]
Generating tokens:   1%|          | 8/1463 [00:00<00:43, 33.64it/s]
Generating tokens:   1%|          | 12/1463 [00:00<00:42, 34.09it/s]
Generating tokens:   1%|          | 16/1463 [00:00<00:42, 34.31it/s]
Generating tokens:   1%|▏         | 20/1463 [00:00<00:41, 34.45it/s]
Generating tokens:   2%|▏         | 24/1463 [00:00<00:41, 34.53it/s]
Generating tokens:   2%|▏         | 28/1463 [00:00<00:41, 34.59it/s]
Generating tokens:   2%|▏         | 32/1463 [00:00<00:41, 34.62it/s]
Generating tokens:   2%|▏         | 36/1463 [00:01<00:41, 34.63it/s]
Generating tokens:   3%|▎         | 40/1463 [00:01<00:41, 34.63it/s]
Generating tokens:   3%|▎         | 44/1463 [00:01<00:40, 34.62it/s]
Generating tokens:   3%|▎         | 48/1463 [00:01<00:40, 34.61it/s]
Generating tokens:   4%|▎         | 52/1463 [00:01<00:40, 34.62it/s]
Generating tokens:   4%|▍         | 56/1463 [00:01<00:40, 34.61it/s]
Generating tokens:   4%|▍         | 60/1463 [00:01<00:40, 34.63it/s]
Generating tokens:   4%|▍         | 64/1463 [00:01<00:40, 34.61it/s]
Generating tokens:   5%|▍         | 68/1463 [00:01<00:40, 34.58it/s]
Generating tokens:   5%|▍         | 72/1463 [00:02<00:40, 34.56it/s]
Generating tokens:   5%|▌         | 76/1463 [00:02<00:40, 34.57it/s]
Generating tokens:   5%|▌         | 80/1463 [00:02<00:40, 34.55it/s]
Generating tokens:   6%|▌         | 84/1463 [00:02<00:39, 34.55it/s]
Generating tokens:   6%|▌         | 88/1463 [00:02<00:39, 34.56it/s]
Generating tokens:   6%|▋         | 92/1463 [00:02<00:39, 34.58it/s]
Generating tokens:   7%|▋         | 96/1463 [00:02<00:39, 34.57it/s]
Generating tokens:   7%|▋         | 100/1463 [00:02<00:39, 34.54it/s]
Generating tokens:   7%|▋         | 104/1463 [00:03<00:39, 34.54it/s]
Generating tokens:   7%|▋         | 108/1463 [00:03<00:39, 34.53it/s]
Generating tokens:   8%|▊         | 112/1463 [00:03<00:39, 34.51it/s]
Generating tokens:   8%|▊         | 116/1463 [00:03<00:39, 34.52it/s]
Generating tokens:   8%|▊         | 120/1463 [00:03<00:38, 34.53it/s]
Generating tokens:   8%|▊         | 124/1463 [00:03<00:38, 34.54it/s]
Generating tokens:   9%|▊         | 128/1463 [00:03<00:38, 34.55it/s]
Generating tokens:   9%|▉         | 132/1463 [00:03<00:38, 34.54it/s]
Generating tokens:   9%|▉         | 132/1463 [00:03<00:38, 34.25it/s]
Written output to /tmp/output/output.wav
>>> output text:  I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!
Written output to /tmp/output/output.txt

Version Details

Version ID: 7500b32387695e89da3d09271850319ba027969f0c714dfc226361609ff29f2b
Version Created: May 1, 2025

Run on Replicate →