zsxkib/whisper-lazyloading 🖼️❓🔢✓📝 → ❓
About
Convert speech in audio to text w/ `tiny`, `small`, `base`, and `large-v3` models

Example Output
Output
{"segments":[{"id":0,"end":18.6,"seek":0,"text":" the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit for a queen's table a big wet stain was on the round carpet","start":0,"tokens":[50365,264,707,27254,436,980,366,7908,264,2853,390,2159,986,9376,293,13436,292,382,731,31421,520,685,366,3318,337,257,12206,311,3199,257,955,6630,16441,390,322,264,3098,18119,51295],"avg_logprob":-0.060147291276513075,"temperature":0,"no_speech_prob":0.05821266025304794,"compression_ratio":1.412280701754386},{"id":1,"end":31.840000000000003,"seek":1860,"text":" the kite dipped and swayed but stayed aloft the pleasant hours fly by much too soon the room was crowded with a mild wab","start":18.6,"tokens":[50365,264,38867,45162,293,27555,292,457,9181,419,6750,264,16232,2496,3603,538,709,886,2321,264,1808,390,21634,365,257,15154,261,455,51027],"avg_logprob":-0.11862952368600028,"temperature":0,"no_speech_prob":0.00025310463388450444,"compression_ratio":1.696969696969697},{"id":2,"end":45.2,"seek":1860,"text":" the room was crowded with a wild mob this strong arm shall shield your honour she blushed when he gave her a white orchid","start":31.840000000000003,"tokens":[51027,264,1808,390,21634,365,257,4868,4298,341,2068,3726,4393,10257,428,20631,750,25218,292,562,415,2729,720,257,2418,34850,327,51695],"avg_logprob":-0.11862952368600028,"temperature":0,"no_speech_prob":0.00025310463388450444,"compression_ratio":1.696969696969697},{"id":3,"end":48.6,"seek":1860,"text":" the beetle droned in the hot june sun","start":45.2,"tokens":[51695,264,49735,1224,19009,294,264,2368,361,2613,3295,51865],"avg_logprob":-0.11862952368600028,"temperature":0,"no_speech_prob":0.00025310463388450444,"compression_ratio":1.696969696969697},{"id":4,"end":52.38,"seek":4860,"text":" the beetle droned in the hot june sun","start":48.6,"tokens":[50365,264,49735,1224,19009,294,264,2368,361,2613,3295,50554],"avg_logprob":-0.3010915426107553,"temperature":0.4,"no_speech_prob":0.2937493324279785,"compression_ratio":0.8409090909090909}],"translation":null,"transcription":" the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit for a queen's table a big wet stain was on the round carpet the kite dipped and swayed but stayed aloft the pleasant hours fly by much too soon the room was crowded with a mild wab the room was crowded with a wild mob this strong arm shall shield your honour she blushed when he gave her a white orchid the beetle droned in the hot june sun the beetle droned in the hot june sun","detected_language":"english"}
Performance Metrics
9.78s
Prediction Time
203.20s
Total Time
All Input Parameters
{ "audio": "https://replicate.delivery/mgxm/e5159b1b-508a-4be4-b892-e1eb47850bdc/OSR_uk_000_0050_8k.wav", "model": "large-v3", "language": "auto", "translate": false, "temperature": 0, "transcription": "plain text", "suppress_tokens": "-1", "logprob_threshold": -1, "no_speech_threshold": 0.6, "condition_on_previous_text": true, "compression_ratio_threshold": 2.4, "temperature_increment_on_fallback": 0.2 }
Input Parameters
- audio (required)
- Audio file
- model
- Whisper model size (currently only large-v3 is supported).
- language
- Language spoken in the audio, specify 'auto' for automatic language detection
- patience
- optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search
- translate
- Translate the text to English when set to True
- temperature
- temperature to use for sampling
- transcription
- Choose the format for the transcription
- initial_prompt
- optional text to provide as a prompt for the first window.
- suppress_tokens
- comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations
- logprob_threshold
- if the average log probability is lower than this value, treat the decoding as failed
- no_speech_threshold
- if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence
- condition_on_previous_text
- if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop
- compression_ratio_threshold
- if the gzip compression ratio is higher than this value, treat the decoding as failed
- temperature_increment_on_fallback
- temperature to increase when falling back when the decoding fails to meet either of the thresholds below
Output Schema
Example Execution Logs
Transcribe with large-v3 model. Detected language: English 0%| | 0/5241 [00:00<?, ?frames/s] 35%|███▌ | 1860/5241 [00:02<00:04, 706.14frames/s] 93%|█████████▎| 4860/5241 [00:06<00:00, 755.37frames/s] 100%|██████████| 5241/5241 [00:08<00:00, 554.70frames/s] 100%|██████████| 5241/5241 [00:08<00:00, 608.37frames/s]
Version Details
- Version ID
909df2f50ba92488979e2c3dea577937b7e991bd815395d3dfbe3bcbf5038276
- Version Created
- July 1, 2024