acappemin/deepaudio-v1 📝🖼️🔢 → 🖼️
About
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
Example Output
Output
Performance Metrics
24.14s
Prediction Time
24.15s
Total Time
All Input Parameters
{
"text": "Who finally decided to show up for work Yay",
"video": "https://replicate.delivery/pbxt/MuPH7VmyWmOEmsGhJDawkwrJR4Ss1HwLdBJ4eXiLwkuPugOf/0235.mp4",
"prompt": "",
"text_prompt": "I've still got a few knocking around in here",
"audio_prompt": "https://replicate.delivery/pbxt/MuPH7KLZCZhnSJ6etBmvdeeJmUjhOMqzb9TLJj4NN5vFZK0Y/Gobber-00-0778.wav",
"v2a_num_steps": 25,
"v2s_num_steps": 32
}
Input Parameters
- text
- Video-to-Speech Transcription
- video
- Input Video
- prompt
- Video-to-Audio Text Prompt
- text_prompt
- Video-to-Speech Speech Prompt Transcription
- audio_prompt
- Video-to-Speech Speech Prompt
- v2a_num_steps
- Video-to-Audio Num Steps
- v2s_num_steps
- Video-to-Speech Num Steps
Output Schema
Output
Example Execution Logs
paths /tmp/tmp1oqg4zow0235.mp4 /tmp/tmp6xokgjqn.mp4/tmp /tmp/__tmp__tmp6xokgjqn.mp4.mp4
paths /tmp/tmp79vmi389Gobber-00-0778.wav /tmp/tmpzu5awjru.wav
2025-04-27 08:16:08.917 start
[[32mINFO [0m]: [32mUsing video /tmp/tmp6xokgjqn.mp4[0m
[[33mWARNING [0m]: [33mClip video is too short: 3.25 < 8.00[0m
[[33mWARNING [0m]: [33mTruncating to 3.25 sec[0m
[[33mWARNING [0m]: [33mSync video is too short: 3.20 < 3.25[0m
[[33mWARNING [0m]: [33mTruncating to 3.20 sec[0m
[[32mINFO [0m]: [32mPrompt: [0m
[[32mINFO [0m]: [32mNegative prompt: [0m
[[32mINFO [0m]: [32mAudio saved to /tmp/__tmp__tmp6xokgjqn.mp4.flac[0m
[[32mINFO [0m]: [32mVideo saved to /tmp/__tmp__tmp6xokgjqn.mp4.mp4[0m
[[32mINFO [0m]: [32mMemory usage: 4.87 GB[0m
2025-04-27 08:16:29.705 end
datas2 1
/tmp/__tmp__tmp6xokgjqn.mp4.mp4 None /tmp/__tmp__tmp6xokgjqn.mp4.flac None /tmp/tmpzu5awjru.wav
############energy shape torch.Size([1, 252, 1]) torch.Size([1, 300, 1]) <class 'torch.Tensor'> <class 'torch.Tensor'> torch.float32 torch.float32
Voice: main
ref_audio /tmp/tmpzu5awjru.wav
Converting audio...
Using custom reference text...
ref_text I've still got a few knocking around in here.
ref_audio_ /tmp/tmpul11dhd9.wav
No voice tag found, using main.
Voice: main
gen_text 0 Who finally decided to show up for work Yay
Generating audio in 1 batches...
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 1.35it/s]
100%|██████████| 1/1 [00:00<00:00, 1.35it/s]
Moviepy - Building video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4.
MoviePy - Writing audio in __tmp__tmp6xokgjqn.mp4.mp4.genTEMP_MPY_wvf_snd.mp4
chunk: 0%| | 0/71 [00:00<?, ?it/s, now=None]
MoviePy - Done.
Moviepy - Writing video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4
t: 0%| | 0/78 [00:00<?, ?it/s, now=None]
t: 14%|█▍ | 11/78 [00:00<00:00, 105.93it/s, now=None]
t: 33%|███▎ | 26/78 [00:00<00:00, 125.79it/s, now=None]
t: 51%|█████▏ | 40/78 [00:00<00:00, 130.95it/s, now=None]
t: 71%|███████ | 55/78 [00:00<00:00, 134.89it/s, now=None]
t: 88%|████████▊ | 69/78 [00:00<00:00, 135.55it/s, now=None][[33mWARNING [0m]: [33m/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/moviepy/video/io/ffmpeg_reader.py:123: UserWarning: Warning: in file /tmp/__tmp__tmp6xokgjqn.mp4.mp4, 6220800 bytes wanted but 0 bytes read,at frame 77/78, at time 3.21/3.23 sec. Using the last valid frame instead.
warnings.warn("Warning: in file %s, "%(self.filename)+
[0m
Moviepy - Done !
Moviepy - video ready /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4
Version Details
- Version ID
354a16e5caccc8bcc33d084b6604f544006e315721f469737a3f3005327b7f45- Version Created
- April 27, 2025