moonshotai/kimi-k2-instruct 🔢📝 → 📝

⭐ Official ▶️ 31.9K runs 📅 Aug 2025 ⚙️ Cog 0.16.2 ⚖️ License

code-generation question-answering text-generation text-translation

About

Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities

Example Output

Prompt:

"What is Replicate?"

Output

Replicate is a cloud-based platform that lets you run open-source machine-learning models (and publish your own) with a simple REST API or a few lines of Python/Node code—no GPUs to rent, no CUDA to install, no containers to build.

Key points:

Instant inference: One line like
replicate.run("stability-ai/stable-diffusion:27b93a2413e7f36cd83da412f119a301612b9c89", {"prompt": "a cat in space"})
returns the generated image in seconds.
Huge model zoo: Thousands of community-contributed models—image generation (Stable Diffusion, SDXL), LLMs (Llama-2, Mistral), speech-to-text (Whisper), code generation, upscaling, voice cloning, etc.
Pay-per-second billing: You pay only for the milliseconds the model is actually executing on their autoscaling GPU/CPU fleet (A100, T4, L4, etc.).
Cog: Their open-source tool lets you package any model into a reproducible Docker container with one cog.yaml file and push it to Replicate so others can run it.
Integrations: Works from Python, Node, cURL, or no-code tools like Zapier; can also spin out persistent endpoints (called “deployments”) that autoscale to zero.

In short, Replicate is “serverless GPUs for ML models”—you get the power of large-scale ML without managing infrastructure.

Performance Metrics

7.40s Prediction Time

7.42s Total Time

All Input Parameters

{
  "top_p": 1,
  "prompt": "What is Replicate?",
  "max_tokens": 4096,
  "temperature": 0.1,
  "presence_penalty": 0,
  "frequency_penalty": 0
}

Input Parameters

top_p Type: numberDefault: 1Range: 0 - 1: Top-p (nucleus) sampling
prompt Type: stringDefault:: Prompt
max_tokens Type: integerDefault: 4096Range: 1 - 16384: The maximum number of tokens the model should generate as output.
temperature Type: numberDefault: 0.1Range: 0 - 2: The value used to modulate the next token probabilities.
presence_penalty Type: numberDefault: 0Range: -2 - 2: Presence penalty
frequency_penalty Type: numberDefault: 0Range: -2 - 2: Frequency penalty

Output Schema

Output

Type: array • Items Type: string

Example Execution Logs

Prompt: What is Replicate?
Input token count: 5
Output token count: 334
TTFT: 0.56s
Tokens per second: 45.15
Total time: 7.40s

Version Details

Version ID: a507c17930624e91e0fa74cfe3b91d4607e59c0e58f9c535452c21f298f2d6ed
Version Created: August 5, 2025

Run on Replicate →