victor-upmeet/whisperx-a40-large ❓✓📝🔢🖼️ → ❓

▶️ 2.7M runs 📅 Nov 2023 ⚙️ Cog 0.19.3 🔗 GitHub 📄 Paper ⚖️ License
speaker-diarization speech-to-text

About

Accelerated transcription, word-level timestamps and diarization with whisperX large-v3 for large audio files

Example Output

Output

{"segments":[{"end":30.811,"text":" The little tales they tell are false. The door was barred, locked and bolted as well. Ripe pears are fit for a queen's table. A big wet stain was on the round carpet. The kite dipped and swayed but stayed aloft. The pleasant hours fly by much too soon. The room was crowded with a mild wob.","start":2.585},{"end":48.592,"text":" The room was crowded with a wild mob. This strong arm shall shield your honor. She blushed when he gave her a white orchid. The beetle droned in the hot June sun.","start":33.029}],"detected_language":"en"}

Performance Metrics

8.00s Prediction Time
139.60s Total Time
All Input Parameters
{
  "debug": false,
  "vad_onset": 0.5,
  "audio_file": "https://replicate.delivery/pbxt/JrckTmbaACSq83MQ5IW8E85b2NPUWZYpCyvxD7A836I5j21G/OSR_uk_000_0050_8k.wav",
  "batch_size": 64,
  "vad_offset": 0.363,
  "diarization": false,
  "temperature": 0,
  "align_output": false
}
Input Parameters
task Default: transcribe
Task to perform on the audio file. Options are: transcribe, translate (English only)
debug Type: booleanDefault: false
Print out compute/inference times and memory usage information
language Type: stringDefault: null
ISO code of the language spoken in the audio, specify None to perform language detection
vad_onset Type: numberDefault: 0.5
VAD onset
audio_file (required) Type: string
Audio file
batch_size Type: integerDefault: 64
Parallelization of input audio transcription
user_agent Type: stringDefault: null
Override the User-Agent used to download the audio file. Useful when the host " "blocks the default value.
vad_offset Type: numberDefault: 0.363
VAD offset
diarization Type: booleanDefault: false
Assign speaker ID labels
temperature Type: numberDefault: 0
Temperature to use for sampling
align_output Type: booleanDefault: false
Aligns whisper output to get accurate word-level timestamps
max_speakers Type: integerDefault: null
Maximum number of speakers if diarization is activated (leave blank if unknown)
min_speakers Type: integerDefault: null
Minimum number of speakers if diarization is activated (leave blank if unknown)
initial_prompt Type: stringDefault: null
Optional text to provide as a prompt for the first window
huggingface_access_token Type: stringDefault: null
To enable diarization, please enter your HuggingFace token (read). You need to accept " "the user agreement for the models specified in the README.
language_detection_min_prob Type: numberDefault: 0
If language is not specified, then the language will be detected recursively on different " "parts of the file until it reaches the given probability
language_detection_max_tries Type: integerDefault: 5
If language is not specified, then the language will be detected following the logic of " "language_detection_min_prob parameter, but will stop after the given max retries. If max " "retries is reached, the most probable language is kept.
Output Schema
segments Type: object
Segments
detected_language Type: string
Detected Language
Example Execution Logs
No language specified, language will be first be detected for each audio file (increases inference time).
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.0.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.0+cu121. Bad things might happen unless you revert torch to 1.x.
Detected language: en (1.00) in first 30s of audio...
Version Details
Version ID
8aad2534a4f2a268a80ab781928cf4bc624b0bbed25afe4d789c70c5781c47b1
Version Created
May 13, 2026
Run on Replicate →