acappemin/deepaudio-v1 📝🖼️🔢 → 🖼️

▶️ 62 runs 📅 Apr 2025 ⚙️ Cog 0.14.7 🔗 GitHub 📄 Paper
audio-generation video-to-audio video-to-speech voice-cloning

About

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Example Output

Output

Performance Metrics

24.14s Prediction Time
24.15s Total Time
All Input Parameters
{
  "text": "Who finally decided to show up for work Yay",
  "video": "https://replicate.delivery/pbxt/MuPH7VmyWmOEmsGhJDawkwrJR4Ss1HwLdBJ4eXiLwkuPugOf/0235.mp4",
  "prompt": "",
  "text_prompt": "I've still got a few knocking around in here",
  "audio_prompt": "https://replicate.delivery/pbxt/MuPH7KLZCZhnSJ6etBmvdeeJmUjhOMqzb9TLJj4NN5vFZK0Y/Gobber-00-0778.wav",
  "v2a_num_steps": 25,
  "v2s_num_steps": 32
}
Input Parameters
text Type: stringDefault:
Video-to-Speech Transcription
video Type: string
Input Video
prompt Type: stringDefault:
Video-to-Audio Text Prompt
text_prompt Type: stringDefault:
Video-to-Speech Speech Prompt Transcription
audio_prompt Type: string
Video-to-Speech Speech Prompt
v2a_num_steps Type: integerDefault: 25
Video-to-Audio Num Steps
v2s_num_steps Type: integerDefault: 32
Video-to-Speech Num Steps
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
paths /tmp/tmp1oqg4zow0235.mp4 /tmp/tmp6xokgjqn.mp4/tmp /tmp/__tmp__tmp6xokgjqn.mp4.mp4
paths /tmp/tmp79vmi389Gobber-00-0778.wav /tmp/tmpzu5awjru.wav
2025-04-27 08:16:08.917 start
[INFO    ]: Using video /tmp/tmp6xokgjqn.mp4
[WARNING ]: Clip video is too short: 3.25 < 8.00
[WARNING ]: Truncating to 3.25 sec
[WARNING ]: Sync video is too short: 3.20 < 3.25
[WARNING ]: Truncating to 3.20 sec
[INFO    ]: Prompt: 
[INFO    ]: Negative prompt: 
[INFO    ]: Audio saved to /tmp/__tmp__tmp6xokgjqn.mp4.flac
[INFO    ]: Video saved to /tmp/__tmp__tmp6xokgjqn.mp4.mp4
[INFO    ]: Memory usage: 4.87 GB
2025-04-27 08:16:29.705 end
datas2 1
/tmp/__tmp__tmp6xokgjqn.mp4.mp4 None /tmp/__tmp__tmp6xokgjqn.mp4.flac None /tmp/tmpzu5awjru.wav
############energy shape torch.Size([1, 252, 1]) torch.Size([1, 300, 1]) <class 'torch.Tensor'> <class 'torch.Tensor'> torch.float32 torch.float32
Voice: main
ref_audio  /tmp/tmpzu5awjru.wav
Converting audio...
Using custom reference text...
ref_text   I've still got a few knocking around in here.
ref_audio_ /tmp/tmpul11dhd9.wav
No voice tag found, using main.
Voice: main
gen_text 0 Who finally decided to show up for work Yay
Generating audio in 1 batches...
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  1.35it/s]
100%|██████████| 1/1 [00:00<00:00,  1.35it/s]
Moviepy - Building video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4.
MoviePy - Writing audio in __tmp__tmp6xokgjqn.mp4.mp4.genTEMP_MPY_wvf_snd.mp4
chunk:   0%|          | 0/71 [00:00<?, ?it/s, now=None]
MoviePy - Done.
Moviepy - Writing video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4
t:   0%|          | 0/78 [00:00<?, ?it/s, now=None]
t:  14%|█▍        | 11/78 [00:00<00:00, 105.93it/s, now=None]
t:  33%|███▎      | 26/78 [00:00<00:00, 125.79it/s, now=None]
t:  51%|█████▏    | 40/78 [00:00<00:00, 130.95it/s, now=None]
t:  71%|███████   | 55/78 [00:00<00:00, 134.89it/s, now=None]
t:  88%|████████▊ | 69/78 [00:00<00:00, 135.55it/s, now=None][WARNING ]: /root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/moviepy/video/io/ffmpeg_reader.py:123: UserWarning: Warning: in file /tmp/__tmp__tmp6xokgjqn.mp4.mp4, 6220800 bytes wanted but 0 bytes read,at frame 77/78, at time 3.21/3.23 sec. Using the last valid frame instead.
warnings.warn("Warning: in file %s, "%(self.filename)+

                                                             
Moviepy - Done !
Moviepy - video ready /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4
Version Details
Version ID
354a16e5caccc8bcc33d084b6604f544006e315721f469737a3f3005327b7f45
Version Created
April 27, 2025
Run on Replicate →