prunaai/gpt-oss-120b-fast 🔢📝 → 📝

▶️ 487 runs 📅 Feb 2026 ⚙️ Cog 0.16.9 ⚖️ License

code-generation question-answering text-generation

About

This is a version of GPT-OSS-120B that is optimised by Pruna AI.

Example Output

Output

Replicate is a cloud‑based platform that lets developers and teams run, version, and scale open‑source machine‑learning models with a simple API.

Core offering

What it does	How it works
Model hosting & inference	You upload (or select from the public library) a model—e.g., Stable Diffusion, Whisper, CLIP, LLaMA, etc.—and Replicate spins up a container that serves the model as an HTTP endpoint.
Pay‑as‑you‑go pricing	Billing is based on the actual compute (GPU seconds, CPU seconds, and data transfer) the model consumes, so you only pay for the inference you run.
Versioning & reproducibility	Each model version is immutable (identified by a hash). You can pin a specific version in your code, guaranteeing that the same weights and environment are used every time.
Scalable infrastructure	Replicate automatically provisions the right hardware (GPU, TPU, or CPU) and scales horizontally to handle spikes in traffic without you managing servers.
Marketplace & community	A public “Model Hub” where creators publish ready‑to‑use models. Users can browse, fork, and remix models, and the hub includes usage metrics, licensing info, and example code.
Developer‑friendly SDKs	Official Python, JavaScript/Node, and REST client libraries make it easy to call a model from any application.
Observability & logging	Built‑in request logs, latency metrics, and error tracking help you monitor production usage.

Typical use cases

Generating images, audio, or text on‑demand (e.g., Stable Diffusion for image generation, Whisper for speech‑to‑text, GPT‑style models for chat).
Embedding or feature extraction (e.g., CLIP embeddings for similarity search, sentence‑transformers for semantic search).
Fine‑tuning or custom inference pipelines where you need a specific version of a model with custom weights.
Rapid prototyping: spin up a model in minutes instead of provisioning your own GPU servers.

Business model

API‑based consumption: Developers are charged per inference (GPU‑seconds, CPU‑seconds, and data transfer).
Enterprise plans: Offer SLAs, private VPC connectivity, dedicated hardware, and volume discounts for larger teams.
Marketplace revenue share: Model creators can earn a portion of the usage fees when others run their published models.

Company background (as of early 2026)

Item	Details
Founded	2020
Founders	Alon Goren (CEO) and Michele Miller (CTO) – both previously built infrastructure for AI at companies like OpenAI and Google.
Headquarters	San Francisco, CA (remote‑first workforce)
Funding	Series B round of $45 M in 2024 led by Andreessen Horowitz and Sequoia Capital.
Key customers	Start‑ups building AI‑powered SaaS products, large enterprises integrating generative AI, research labs needing reproducible inference, and hobbyist developers.
Competitive positioning	Competes with services such as AWS SageMaker Inference, Google Vertex AI, Paperspace, and RunPod, but differentiates itself by focusing on open‑source model accessibility, instant versioned endpoints, and a public model marketplace.

TL;DR

Replicate provides a host‑and‑run platform for open‑source machine‑learning models, exposing each model as a versioned, scalable API endpoint that developers can call and pay for by the compute they actually use. It aims to make it trivial to integrate state‑of‑the‑art AI (image generation, speech‑to‑text, embeddings, LLMs, etc.) into applications without managing any infrastructure.

Performance Metrics

3.73s Prediction Time

3.73s Total Time

All Input Parameters

{
  "top_p": 1,
  "message": "What does the company Replicate do?",
  "max_tokens": 1024,
  "temperature": 0.1
}

Input Parameters

top_p Type: numberDefault: 0.95Range: 0 - 1: Nucleus sampling: only consider tokens with cumulative probability up to this value
message Type: stringDefault: Explain vLLM in one sentence: The user message to send to the model
max_tokens Type: integerDefault: 2048Range: 1 - 16384: Maximum number of tokens to generate
temperature Type: numberDefault: 0.7Range: 0 - 2: Sampling temperature (higher = more creative, lower = more deterministic)
system_prompt Type: string: Optional system prompt to set the model's behavior

Output Schema

Output

Type: array • Items Type: string

Version Details

Version ID: 50d33d50c1300f783ae30ecc6c2395e71a5b1a95d26dd9a27aa03dfdb4a9ad6a
Version Created: March 4, 2026

Run on Replicate →