acappemin/deepaudio-v1 📝🖼️🔢 → 🖼️
About
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
Example Output
Output
Performance Metrics
24.14s
Prediction Time
24.15s
Total Time
All Input Parameters
{ "text": "Who finally decided to show up for work Yay", "video": "https://replicate.delivery/pbxt/MuPH7VmyWmOEmsGhJDawkwrJR4Ss1HwLdBJ4eXiLwkuPugOf/0235.mp4", "prompt": "", "text_prompt": "I've still got a few knocking around in here", "audio_prompt": "https://replicate.delivery/pbxt/MuPH7KLZCZhnSJ6etBmvdeeJmUjhOMqzb9TLJj4NN5vFZK0Y/Gobber-00-0778.wav", "v2a_num_steps": 25, "v2s_num_steps": 32 }
Input Parameters
- text
- Video-to-Speech Transcription
- video
- Input Video
- prompt
- Video-to-Audio Text Prompt
- text_prompt
- Video-to-Speech Speech Prompt Transcription
- audio_prompt
- Video-to-Speech Speech Prompt
- v2a_num_steps
- Video-to-Audio Num Steps
- v2s_num_steps
- Video-to-Speech Num Steps
Output Schema
Output
Example Execution Logs
paths /tmp/tmp1oqg4zow0235.mp4 /tmp/tmp6xokgjqn.mp4/tmp /tmp/__tmp__tmp6xokgjqn.mp4.mp4 paths /tmp/tmp79vmi389Gobber-00-0778.wav /tmp/tmpzu5awjru.wav 2025-04-27 08:16:08.917 start [[32mINFO [0m]: [32mUsing video /tmp/tmp6xokgjqn.mp4[0m [[33mWARNING [0m]: [33mClip video is too short: 3.25 < 8.00[0m [[33mWARNING [0m]: [33mTruncating to 3.25 sec[0m [[33mWARNING [0m]: [33mSync video is too short: 3.20 < 3.25[0m [[33mWARNING [0m]: [33mTruncating to 3.20 sec[0m [[32mINFO [0m]: [32mPrompt: [0m [[32mINFO [0m]: [32mNegative prompt: [0m [[32mINFO [0m]: [32mAudio saved to /tmp/__tmp__tmp6xokgjqn.mp4.flac[0m [[32mINFO [0m]: [32mVideo saved to /tmp/__tmp__tmp6xokgjqn.mp4.mp4[0m [[32mINFO [0m]: [32mMemory usage: 4.87 GB[0m 2025-04-27 08:16:29.705 end datas2 1 /tmp/__tmp__tmp6xokgjqn.mp4.mp4 None /tmp/__tmp__tmp6xokgjqn.mp4.flac None /tmp/tmpzu5awjru.wav ############energy shape torch.Size([1, 252, 1]) torch.Size([1, 300, 1]) <class 'torch.Tensor'> <class 'torch.Tensor'> torch.float32 torch.float32 Voice: main ref_audio /tmp/tmpzu5awjru.wav Converting audio... Using custom reference text... ref_text I've still got a few knocking around in here. ref_audio_ /tmp/tmpul11dhd9.wav No voice tag found, using main. Voice: main gen_text 0 Who finally decided to show up for work Yay Generating audio in 1 batches... 0%| | 0/1 [00:00<?, ?it/s] 100%|██████████| 1/1 [00:00<00:00, 1.35it/s] 100%|██████████| 1/1 [00:00<00:00, 1.35it/s] Moviepy - Building video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4. MoviePy - Writing audio in __tmp__tmp6xokgjqn.mp4.mp4.genTEMP_MPY_wvf_snd.mp4 chunk: 0%| | 0/71 [00:00<?, ?it/s, now=None] MoviePy - Done. Moviepy - Writing video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4 t: 0%| | 0/78 [00:00<?, ?it/s, now=None] t: 14%|█▍ | 11/78 [00:00<00:00, 105.93it/s, now=None] t: 33%|███▎ | 26/78 [00:00<00:00, 125.79it/s, now=None] t: 51%|█████▏ | 40/78 [00:00<00:00, 130.95it/s, now=None] t: 71%|███████ | 55/78 [00:00<00:00, 134.89it/s, now=None] t: 88%|████████▊ | 69/78 [00:00<00:00, 135.55it/s, now=None][[33mWARNING [0m]: [33m/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/moviepy/video/io/ffmpeg_reader.py:123: UserWarning: Warning: in file /tmp/__tmp__tmp6xokgjqn.mp4.mp4, 6220800 bytes wanted but 0 bytes read,at frame 77/78, at time 3.21/3.23 sec. Using the last valid frame instead. warnings.warn("Warning: in file %s, "%(self.filename)+ [0m Moviepy - Done ! Moviepy - video ready /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4
Version Details
- Version ID
354a16e5caccc8bcc33d084b6604f544006e315721f469737a3f3005327b7f45
- Version Created
- April 27, 2025