prunaai/gpt-oss-120b-fast 🔢📝 → 📝

▶️ 353 runs 📅 Feb 2026 ⚙️ Cog 0.16.9 ⚖️ License
code-generation question-answering text-generation

About

This is a version of GPT-OSS-120B that is optimised by Pruna AI.

Example Output

Output

Replicate is a cloud‑based platform that lets developers and teams run, version, and scale open‑source machine‑learning models with a simple API.

Core offering

What it does How it works
Model hosting & inference You upload (or select from the public library) a model—e.g., Stable Diffusion, Whisper, CLIP, LLaMA, etc.—and Replicate spins up a container that serves the model as an HTTP endpoint.
Pay‑as‑you‑go pricing Billing is based on the actual compute (GPU seconds, CPU seconds, and data transfer) the model consumes, so you only pay for the inference you run.
Versioning & reproducibility Each model version is immutable (identified by a hash). You can pin a specific version in your code, guaranteeing that the same weights and environment are used every time.
Scalable infrastructure Replicate automatically provisions the right hardware (GPU, TPU, or CPU) and scales horizontally to handle spikes in traffic without you managing servers.
Marketplace & community A public “Model Hub” where creators publish ready‑to‑use models. Users can browse, fork, and remix models, and the hub includes usage metrics, licensing info, and example code.
Developer‑friendly SDKs Official Python, JavaScript/Node, and REST client libraries make it easy to call a model from any application.
Observability & logging Built‑in request logs, latency metrics, and error tracking help you monitor production usage.

Typical use cases

  • Generating images, audio, or text on‑demand (e.g., Stable Diffusion for image generation, Whisper for speech‑to‑text, GPT‑style models for chat).
  • Embedding or feature extraction (e.g., CLIP embeddings for similarity search, sentence‑transformers for semantic search).
  • Fine‑tuning or custom inference pipelines where you need a specific version of a model with custom weights.
  • Rapid prototyping: spin up a model in minutes instead of provisioning your own GPU servers.

Business model

  • API‑based consumption: Developers are charged per inference (GPU‑seconds, CPU‑seconds, and data transfer).
  • Enterprise plans: Offer SLAs, private VPC connectivity, dedicated hardware, and volume discounts for larger teams.
  • Marketplace revenue share: Model creators can earn a portion of the usage fees when others run their published models.

Company background (as of early 2026)

Item Details
Founded 2020
Founders Alon Goren (CEO) and Michele Miller (CTO) – both previously built infrastructure for AI at companies like OpenAI and Google.
Headquarters San Francisco, CA (remote‑first workforce)
Funding Series B round of $45 M in 2024 led by Andreessen Horowitz and Sequoia Capital.
Key customers Start‑ups building AI‑powered SaaS products, large enterprises integrating generative AI, research labs needing reproducible inference, and hobbyist developers.
Competitive positioning Competes with services such as AWS SageMaker Inference, Google Vertex AI, Paperspace, and RunPod, but differentiates itself by focusing on open‑source model accessibility, instant versioned endpoints, and a public model marketplace.

TL;DR

Replicate provides a host‑and‑run platform for open‑source machine‑learning models, exposing each model as a versioned, scalable API endpoint that developers can call and pay for by the compute they actually use. It aims to make it trivial to integrate state‑of‑the‑art AI (image generation, speech‑to‑text, embeddings, LLMs, etc.) into applications without managing any infrastructure.

Performance Metrics

3.73s Prediction Time
3.73s Total Time
All Input Parameters
{
  "top_p": 1,
  "message": "What does the company Replicate do?",
  "max_tokens": 1024,
  "temperature": 0.1
}
Input Parameters
top_p Type: numberDefault: 0.95Range: 0 - 1
Nucleus sampling: only consider tokens with cumulative probability up to this value
message Type: stringDefault: Explain vLLM in one sentence
The user message to send to the model
max_tokens Type: integerDefault: 2048Range: 1 - 16384
Maximum number of tokens to generate
temperature Type: numberDefault: 0.7Range: 0 - 2
Sampling temperature (higher = more creative, lower = more deterministic)
system_prompt Type: string
Optional system prompt to set the model's behavior
Output Schema

Output

Type: arrayItems Type: string

Version Details
Version ID
50d33d50c1300f783ae30ecc6c2395e71a5b1a95d26dd9a27aa03dfdb4a9ad6a
Version Created
March 4, 2026
Run on Replicate →