romanfurman6/whisperx-multi-chunk ✓📝🔢🔊 → ❓

▶️ 10 runs 📅 Jul 2025 ⚙️ Cog 0.15.9 🔗 GitHub
speaker-diarization speech-to-text

About

WhisperX that works with multiple chunks, download, processing and merging the results

Example Output

Output

{"segments":[{"end":23.858,"text":" Welcome to an amazing journey through the fascinating world of cats. Whether you're a devoted cat parent, thinking about getting a feline friend, or just curious about these incredible creatures, you're about to discover some absolutely mind-blowing facts that will make you appreciate cats in a whole new way. From their incredible superpowers to their ancient history, from their unique biology to their mysterious behaviors,","start":0.031,"chunk_index":0},{"end":30.052,"text":" We're going to explore everything that makes cats so special. So settle in and get ready to be amazed by the wonderful","start":23.858,"chunk_index":0},{"end":618.302,"text":" world of whiskers, purrs, and everything cats. Let's start with some truly incredible cat superpowers that might surprise you. Did you know that cats can rotate their ears a full 180 degrees? That's right. They have an amazing 32 muscles controlling each ear, compared to our measly six muscles.","start":601.1229999999999,"chunk_index":1},{"end":631.144,"text":" This gives them incredibly precise hearing that's about five times more sensitive than human hearing. They can detect sounds up to 64,000 Hz, which is almost two octaves higher than what humans can hear. This means your cat can hear the ultrasonic","start":618.302,"chunk_index":1},{"end":1232.151,"text":" calls of mice and other small prey that are completely silent to us but their hearing isn't the only superpower cats possess here's something that might surprise you cats can't taste sweetness at all they're completely missing the taste receptors for sugar which explains why they're not interested in your ice cream or candy however they make up for this with an incredible sense of smell that's about 14 times stronger than ours they have approximately 200 million scent receptors compared to our 5 million cats also have a special order","start":1202.215,"chunk_index":2}],"total_chunks":3,"processing_time":18.168604135513306,"detected_language":"en"}

Performance Metrics

18.17s Prediction Time
76.01s Total Time
All Input Parameters
{
  "debug": true,
  "language": "en",
  "vad_onset": 0.5,
  "audio_urls": [
    "https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_1.flac",
    "https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_2.flac",
    "https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_3.flac"
  ],
  "batch_size": 32,
  "vad_offset": 0.363,
  "diarization": false,
  "temperature": 0.2,
  "align_output": false,
  "chunk_size_seconds": 30,
  "total_duration_seconds": 90,
  "language_detection_min_prob": 0.7,
  "language_detection_max_tries": 5
}
Input Parameters
debug Type: booleanDefault: true
Print debug information
language Type: string
ISO code of the language spoken in the audio, specify None to perform language detection
vad_onset Type: numberDefault: 0.5
VAD onset threshold
audio_urls (required) Type: array
Array of public audio urls to process
batch_size Type: integerDefault: 32
Parallelization of input audio transcription
vad_offset Type: numberDefault: 0.363
VAD offset threshold
diarization Type: booleanDefault: false
Whether to perform diarization
temperature Type: numberDefault: 0.2
Temperature to use for sampling
align_output Type: booleanDefault: false
Whether to align output for word-level timestamps
max_speakers Type: integer
Maximum number of speakers if diarization is activated
min_speakers Type: integer
Minimum number of speakers if diarization is activated
initial_prompt Type: string
Optional text prompt for the first window
chunk_size_seconds (required) Type: number
Duration of each chunk in seconds (used for timestamp calculation). Latest chunk can be shorter, it will be calculated based on the total duration and the number of chunks.
total_duration_seconds (required) Type: number
Total duration of the complete audio in seconds
huggingface_access_token Type: string
HuggingFace token for diarization
language_detection_min_prob Type: numberDefault: 0.7
Minimum probability for recursive language detection
language_detection_max_tries Type: integerDefault: 5
Maximum retries for recursive language detection
Output Schema
segments
Segments
total_chunks Type: integer
Total Chunks
processing_time Type: number
Processing Time
detected_language Type: string
Detected Language
Example Execution Logs
Processing 3 audio URLs
Total duration: 90.00 seconds
Expected chunk duration: 30.00 seconds
Chunk size parameter: 30.00 seconds
Downloading chunk 1: https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_1.flac
Downloaded chunk 1 to /tmp/tmpr8gh3eyc.flac
Downloading chunk 2: https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_2.flacProcessing chunk 1 with offset 0.00s
Downloaded chunk 2 to /tmp/tmp298m71iy.flac
>>Performing voice activity detection using Pyannote...
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu121. Bad things might happen unless you revert torch to 1.x.
/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/pyannote/audio/utils/reproducibility.py:74: ReproducibilityWarning: TensorFloat-32 (TF32) has been disabled as it might lead to reproducibility issues and lower accuracy.
It can be re-enabled by calling
>>> import torch
>>> torch.backends.cuda.matmul.allow_tf32 = True
>>> torch.backends.cudnn.allow_tf32 = True
See https://github.com/pyannote/pyannote-audio/issues/1370 for more details.
warnings.warn(
Completed chunk 1
Downloading chunk 3: https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_3.flac
Processing chunk 2 with offset 601.09s
Downloaded chunk 3 to /tmp/tmp_2psrj2v.flac
>>Performing voice activity detection using Pyannote...
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu121. Bad things might happen unless you revert torch to 1.x.
Completed chunk 2
Processing chunk 3 with offset 1202.18s
>>Performing voice activity detection using Pyannote...
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu121. Bad things might happen unless you revert torch to 1.x.
Completed chunk 3
Merged 5 segments from 3 chunks
Total transcription duration: 1232.15 seconds
First segment: 0.03s - 23.86s
Last segment: 1202.21s - 1232.15s
Total processing time: 18.17 seconds
Version Details
Version ID
d1ad611913cc51721dfeb9499c71d5866db3e6263619a6477782df67cde040d5
Version Created
July 14, 2025
Run on Replicate →