moonshotai/kimi-k2-instruct 🔢📝 → 📝

⭐ Official ▶️ 31.9K runs 📅 Aug 2025 ⚙️ Cog 0.16.2 ⚖️ License
code-generation question-answering text-generation text-translation

About

Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities

Example Output

Prompt:

"What is Replicate?"

Output

Replicate is a cloud-based platform that lets you run open-source machine-learning models (and publish your own) with a simple REST API or a few lines of Python/Node code—no GPUs to rent, no CUDA to install, no containers to build.

Key points:

  • Instant inference: One line like
    replicate.run("stability-ai/stable-diffusion:27b93a2413e7f36cd83da412f119a301612b9c89", {"prompt": "a cat in space"})
    returns the generated image in seconds.

  • Huge model zoo: Thousands of community-contributed models—image generation (Stable Diffusion, SDXL), LLMs (Llama-2, Mistral), speech-to-text (Whisper), code generation, upscaling, voice cloning, etc.

  • Pay-per-second billing: You pay only for the milliseconds the model is actually executing on their autoscaling GPU/CPU fleet (A100, T4, L4, etc.).

  • Cog: Their open-source tool lets you package any model into a reproducible Docker container with one cog.yaml file and push it to Replicate so others can run it.

  • Integrations: Works from Python, Node, cURL, or no-code tools like Zapier; can also spin out persistent endpoints (called “deployments”) that autoscale to zero.

In short, Replicate is “serverless GPUs for ML models”—you get the power of large-scale ML without managing infrastructure.

Performance Metrics

7.40s Prediction Time
7.42s Total Time
All Input Parameters
{
  "top_p": 1,
  "prompt": "What is Replicate?",
  "max_tokens": 4096,
  "temperature": 0.1,
  "presence_penalty": 0,
  "frequency_penalty": 0
}
Input Parameters
top_p Type: numberDefault: 1Range: 0 - 1
Top-p (nucleus) sampling
prompt Type: stringDefault:
Prompt
max_tokens Type: integerDefault: 4096Range: 1 - 16384
The maximum number of tokens the model should generate as output.
temperature Type: numberDefault: 0.1Range: 0 - 2
The value used to modulate the next token probabilities.
presence_penalty Type: numberDefault: 0Range: -2 - 2
Presence penalty
frequency_penalty Type: numberDefault: 0Range: -2 - 2
Frequency penalty
Output Schema

Output

Type: arrayItems Type: string

Example Execution Logs
Prompt: What is Replicate?
Input token count: 5
Output token count: 334
TTFT: 0.56s
Tokens per second: 45.15
Total time: 7.40s
Version Details
Version ID
a507c17930624e91e0fa74cfe3b91d4607e59c0e58f9c535452c21f298f2d6ed
Version Created
August 5, 2025
Run on Replicate →