openai/gpt-4o-transcribe 📝🖼️🔢 → 📝
About
A speech-to-text model that uses GPT-4o to transcribe audio

Example Output
Output
So we just added GPT-4o transcribe to Replicate and thought you'd want to know. It's basically a speech-to-text model that uses GPT-4o to turn your audio into text. The cool thing is that it's noticeably better than the Whisper models we've been using, fewer errors, better at recognizing different languages, and just more accurate overall. If you've ever been frustrated with transcripts that mess up technical terms or struggle with different accents, you'll probably appreciate this upgrade. It just works better. Some quick tech specs if you're curious. It has a 16,000 token context window, which means it can handle longer audio clips in one go. And it can output up to 2,000 tokens, so you'll get nice complete transcripts. The model's knowledge is current up to June 2024, so it's pretty up-to-date with language and terminology.
Performance Metrics
2.89s
Prediction Time
2.90s
Total Time
All Input Parameters
{ "language": "en", "audio_file": "https://replicate.delivery/xezq/XoxHeakty0z3KKc46cMLPKC2ct54ekT3EtvcwDQuRIuxfJdpA/tmpsglqtqn5.mp3", "temperature": 0 }
Input Parameters
- prompt
- An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
- language
- The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
- audio_file (required)
- The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm
- temperature
- Sampling temperature between 0 and 1
Output Schema
Output
Example Execution Logs
Input audio duration: 54.756 seconds Input token count: 912 Output token count: 174 Total token count: 1086 TTFT: 1.38s
Version Details
- Version ID
cf92fe5e0d9a451f2c47c58883af0ff92e3908c138239d8cba7f8646e99657bc
- Version Created
- October 13, 2025