romanfurman6/whisperx-multi-chunk ✓📝🔢🔊 → ❓
About
WhisperX that works with multiple chunks, download, processing and merging the results
Example Output
Output
{"segments":[{"end":23.858,"text":" Welcome to an amazing journey through the fascinating world of cats. Whether you're a devoted cat parent, thinking about getting a feline friend, or just curious about these incredible creatures, you're about to discover some absolutely mind-blowing facts that will make you appreciate cats in a whole new way. From their incredible superpowers to their ancient history, from their unique biology to their mysterious behaviors,","start":0.031,"chunk_index":0},{"end":30.052,"text":" We're going to explore everything that makes cats so special. So settle in and get ready to be amazed by the wonderful","start":23.858,"chunk_index":0},{"end":618.302,"text":" world of whiskers, purrs, and everything cats. Let's start with some truly incredible cat superpowers that might surprise you. Did you know that cats can rotate their ears a full 180 degrees? That's right. They have an amazing 32 muscles controlling each ear, compared to our measly six muscles.","start":601.1229999999999,"chunk_index":1},{"end":631.144,"text":" This gives them incredibly precise hearing that's about five times more sensitive than human hearing. They can detect sounds up to 64,000 Hz, which is almost two octaves higher than what humans can hear. This means your cat can hear the ultrasonic","start":618.302,"chunk_index":1},{"end":1232.151,"text":" calls of mice and other small prey that are completely silent to us but their hearing isn't the only superpower cats possess here's something that might surprise you cats can't taste sweetness at all they're completely missing the taste receptors for sugar which explains why they're not interested in your ice cream or candy however they make up for this with an incredible sense of smell that's about 14 times stronger than ours they have approximately 200 million scent receptors compared to our 5 million cats also have a special order","start":1202.215,"chunk_index":2}],"total_chunks":3,"processing_time":18.168604135513306,"detected_language":"en"}
Performance Metrics
18.17s
Prediction Time
76.01s
Total Time
All Input Parameters
{ "debug": true, "language": "en", "vad_onset": 0.5, "audio_urls": [ "https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_1.flac", "https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_2.flac", "https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_3.flac" ], "batch_size": 32, "vad_offset": 0.363, "diarization": false, "temperature": 0.2, "align_output": false, "chunk_size_seconds": 30, "total_duration_seconds": 90, "language_detection_min_prob": 0.7, "language_detection_max_tries": 5 }
Input Parameters
- debug
- Print debug information
- language
- ISO code of the language spoken in the audio, specify None to perform language detection
- vad_onset
- VAD onset threshold
- audio_urls (required)
- Array of public audio urls to process
- batch_size
- Parallelization of input audio transcription
- vad_offset
- VAD offset threshold
- diarization
- Whether to perform diarization
- temperature
- Temperature to use for sampling
- align_output
- Whether to align output for word-level timestamps
- max_speakers
- Maximum number of speakers if diarization is activated
- min_speakers
- Minimum number of speakers if diarization is activated
- initial_prompt
- Optional text prompt for the first window
- chunk_size_seconds (required)
- Duration of each chunk in seconds (used for timestamp calculation). Latest chunk can be shorter, it will be calculated based on the total duration and the number of chunks.
- total_duration_seconds (required)
- Total duration of the complete audio in seconds
- huggingface_access_token
- HuggingFace token for diarization
- language_detection_min_prob
- Minimum probability for recursive language detection
- language_detection_max_tries
- Maximum retries for recursive language detection
Output Schema
- segments
- Segments
- total_chunks
- Total Chunks
- processing_time
- Processing Time
- detected_language
- Detected Language
Example Execution Logs
Processing 3 audio URLs Total duration: 90.00 seconds Expected chunk duration: 30.00 seconds Chunk size parameter: 30.00 seconds Downloading chunk 1: https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_1.flac Downloaded chunk 1 to /tmp/tmpr8gh3eyc.flac Downloading chunk 2: https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_2.flacProcessing chunk 1 with offset 0.00s Downloaded chunk 2 to /tmp/tmp298m71iy.flac >>Performing voice activity detection using Pyannote... Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin` Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu121. Bad things might happen unless you revert torch to 1.x. /root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/pyannote/audio/utils/reproducibility.py:74: ReproducibilityWarning: TensorFloat-32 (TF32) has been disabled as it might lead to reproducibility issues and lower accuracy. It can be re-enabled by calling >>> import torch >>> torch.backends.cuda.matmul.allow_tf32 = True >>> torch.backends.cudnn.allow_tf32 = True See https://github.com/pyannote/pyannote-audio/issues/1370 for more details. warnings.warn( Completed chunk 1 Downloading chunk 3: https://storage.googleapis.com/meowtxt-bucket/test/chunks/fca16cb3-dbb3-4d2c-8f45-84a75354d125/fca16cb3-dbb3-4d2c-8f45-84a75354d125_chunk_3.flac Processing chunk 2 with offset 601.09s Downloaded chunk 3 to /tmp/tmp_2psrj2v.flac >>Performing voice activity detection using Pyannote... Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin` Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu121. Bad things might happen unless you revert torch to 1.x. Completed chunk 2 Processing chunk 3 with offset 1202.18s >>Performing voice activity detection using Pyannote... Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin` Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu121. Bad things might happen unless you revert torch to 1.x. Completed chunk 3 Merged 5 segments from 3 chunks Total transcription duration: 1232.15 seconds First segment: 0.03s - 23.86s Last segment: 1202.21s - 1232.15s Total processing time: 18.17 seconds
Version Details
- Version ID
d1ad611913cc51721dfeb9499c71d5866db3e6263619a6477782df67cde040d5
- Version Created
- July 14, 2025