GPT-4o Transcribe — OpenAI's Audio Transcription Model 📝🖼️🔢 → 📝

⭐ Official ▶️ 50.5K runs 📅 May 2025 ⚙️ Cog 0.16.8 ⚖️ License
speech-to-text transcription audio openai

GPT-4o Transcribe is OpenAI's latest audio transcription model. Run it on Replicate to transcribe audio files without setting up OpenAI API keys or managing GPU infrastructure.

About

GPT-4o Transcribe converts speech to text using OpenAI's multimodal architecture. It handles multiple languages, accents, background noise, and overlapping speakers better than earlier Whisper-based models.

The model produces accurate transcriptions with proper punctuation and formatting, making it practical for meeting notes, podcast transcripts, subtitle generation, and voice-to-text workflows.

Typical use cases

  • Transcribing meetings, interviews, and calls
  • Generating subtitles for video content
  • Converting voice memos to searchable text
  • Processing multilingual audio in a single pipeline

Example Output

Output

So we just added GPT-4o transcribe to Replicate and thought you'd want to know. It's basically a speech-to-text model that uses GPT-4o to turn your audio into text. The cool thing is that it's noticeably better than the Whisper models we've been using, fewer errors, better at recognizing different languages, and just more accurate overall. If you've ever been frustrated with transcripts that mess up technical terms or struggle with different accents, you'll probably appreciate this upgrade. It just works better. Some quick tech specs if you're curious. It has a 16,000 token context window, which means it can handle longer audio clips in one go. And it can output up to 2,000 tokens, so you'll get nice complete transcripts. The model's knowledge is current up to June 2024, so it's pretty up-to-date with language and terminology.

Performance Metrics

2.89s Prediction Time
2.90s Total Time
All Input Parameters
{
  "language": "en",
  "audio_file": "https://replicate.delivery/xezq/XoxHeakty0z3KKc46cMLPKC2ct54ekT3EtvcwDQuRIuxfJdpA/tmpsglqtqn5.mp3",
  "temperature": 0
}
Input Parameters
prompt Type: string
An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
language Type: string
The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
audio_file (required) Type: string
The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm
temperature Type: numberDefault: 0Range: 0 - 1
Sampling temperature between 0 and 1
Output Schema

Output

Type: arrayItems Type: string

Example Execution Logs
Input audio duration: 54.756 seconds
Input token count: 912
Output token count: 174
Total token count: 1086
TTFT: 1.38s
Version Details
Version ID
cc7638666fc85e9defb010d99e304c0c0e94dcdbd3d31385f28f2730b4cdcc6d
Version Created
November 7, 2025
Run on Replicate →