acappemin/deepaudio-v1 📝🖼️🔢 → 🖼️

▶️ 72 runs 📅 Apr 2025 ⚙️ Cog 0.14.7 🔗 GitHub 📄 Paper

audio-generation text-to-speech video-to-audio video-to-speech voice-cloning

Performance

24.1sTypical run time

72Total runs

About

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Example Output

Output

Performance Metrics

24.14s Prediction Time

24.15s Total Time

All Input Parameters

{
  "text": "Who finally decided to show up for work Yay",
  "video": "https://replicate.delivery/pbxt/MuPH7VmyWmOEmsGhJDawkwrJR4Ss1HwLdBJ4eXiLwkuPugOf/0235.mp4",
  "prompt": "",
  "text_prompt": "I've still got a few knocking around in here",
  "audio_prompt": "https://replicate.delivery/pbxt/MuPH7KLZCZhnSJ6etBmvdeeJmUjhOMqzb9TLJj4NN5vFZK0Y/Gobber-00-0778.wav",
  "v2a_num_steps": 25,
  "v2s_num_steps": 32
}

Input Parameters

text Type: stringDefault:: Video-to-Speech Transcription
video Type: string: Input Video
prompt Type: stringDefault:: Video-to-Audio Text Prompt
text_prompt Type: stringDefault:: Video-to-Speech Speech Prompt Transcription
audio_prompt Type: string: Video-to-Speech Speech Prompt
v2a_num_steps Type: integerDefault: 25: Video-to-Audio Num Steps
v2s_num_steps Type: integerDefault: 32: Video-to-Speech Num Steps

Output Schema

Output

Type: string • Format: uri

Example Execution Logs

paths /tmp/tmp1oqg4zow0235.mp4 /tmp/tmp6xokgjqn.mp4/tmp /tmp/__tmp__tmp6xokgjqn.mp4.mp4
paths /tmp/tmp79vmi389Gobber-00-0778.wav /tmp/tmpzu5awjru.wav
2025-04-27 08:16:08.917 start
[[32mINFO    [0m]: [32mUsing video /tmp/tmp6xokgjqn.mp4[0m
[[33mWARNING [0m]: [33mClip video is too short: 3.25 < 8.00[0m
[[33mWARNING [0m]: [33mTruncating to 3.25 sec[0m
[[33mWARNING [0m]: [33mSync video is too short: 3.20 < 3.25[0m
[[33mWARNING [0m]: [33mTruncating to 3.20 sec[0m
[[32mINFO    [0m]: [32mPrompt: [0m
[[32mINFO    [0m]: [32mNegative prompt: [0m
[[32mINFO    [0m]: [32mAudio saved to /tmp/__tmp__tmp6xokgjqn.mp4.flac[0m
[[32mINFO    [0m]: [32mVideo saved to /tmp/__tmp__tmp6xokgjqn.mp4.mp4[0m
[[32mINFO    [0m]: [32mMemory usage: 4.87 GB[0m
2025-04-27 08:16:29.705 end
datas2 1
/tmp/__tmp__tmp6xokgjqn.mp4.mp4 None /tmp/__tmp__tmp6xokgjqn.mp4.flac None /tmp/tmpzu5awjru.wav
############energy shape torch.Size([1, 252, 1]) torch.Size([1, 300, 1]) <class 'torch.Tensor'> <class 'torch.Tensor'> torch.float32 torch.float32
Voice: main
ref_audio  /tmp/tmpzu5awjru.wav
Converting audio...
Using custom reference text...
ref_text   I've still got a few knocking around in here.
ref_audio_ /tmp/tmpul11dhd9.wav
No voice tag found, using main.
Voice: main
gen_text 0 Who finally decided to show up for work Yay
Generating audio in 1 batches...
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  1.35it/s]
100%|██████████| 1/1 [00:00<00:00,  1.35it/s]
Moviepy - Building video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4.
MoviePy - Writing audio in __tmp__tmp6xokgjqn.mp4.mp4.genTEMP_MPY_wvf_snd.mp4
chunk:   0%|          | 0/71 [00:00<?, ?it/s, now=None]
MoviePy - Done.
Moviepy - Writing video /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4
t:   0%|          | 0/78 [00:00<?, ?it/s, now=None]
t:  14%|█▍        | 11/78 [00:00<00:00, 105.93it/s, now=None]
t:  33%|███▎      | 26/78 [00:00<00:00, 125.79it/s, now=None]
t:  51%|█████▏    | 40/78 [00:00<00:00, 130.95it/s, now=None]
t:  71%|███████   | 55/78 [00:00<00:00, 134.89it/s, now=None]
t:  88%|████████▊ | 69/78 [00:00<00:00, 135.55it/s, now=None][[33mWARNING [0m]: [33m/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/moviepy/video/io/ffmpeg_reader.py:123: UserWarning: Warning: in file /tmp/__tmp__tmp6xokgjqn.mp4.mp4, 6220800 bytes wanted but 0 bytes read,at frame 77/78, at time 3.21/3.23 sec. Using the last valid frame instead.
warnings.warn("Warning: in file %s, "%(self.filename)+
[0m
                                                             
Moviepy - Done !
Moviepy - video ready /tmp/__tmp__tmp6xokgjqn.mp4.mp4.gen.mp4

Version Details

Version ID: 354a16e5caccc8bcc33d084b6604f544006e315721f469737a3f3005327b7f45
Version Created: April 27, 2025

Run on Replicate →