lagune870601/sonic_cc 🔢🖼️✓ → 🖼️

▶️ 142 runs 📅 Jun 2025 ⚙️ Cog 0.15.3
lipsync

About

sonic_cc

Example Output

Output

Performance Metrics

396.41s Prediction Time
456.85s Total Time
All Input Parameters
{
  "audio": "https://replicate.delivery/pbxt/Nm49saZpPThIv9WRi6Z67k0oXBNoyS3Yw3V3pAufR2EqC2QF/1_song_audio.mp3",
  "image": "https://replicate.delivery/pbxt/Nm49sEWxE9ttqF3uVRWc5HUDHxMBqGiknEBCKJGh56MadrKz/c3e9ece9-eccf-4887-a45b-0e4d6ac7015d.jpeg",
  "crop_image": false,
  "dynamic_scale": 1,
  "min_resolution": 512,
  "inference_steps": 25,
  "keep_resolution": false
}
Input Parameters
seed Type: integer
Random seed for reproducible results. Leave blank for a random seed.
audio (required) Type: string
Input audio file (WAV, MP3, etc.) for the voice.
image (required) Type: string
Input portrait image (will be cropped if face is detected).
crop_image Type: booleanDefault: false
If true, cut image and leave header only
dynamic_scale Type: numberDefault: 1Range: 0.5 - 2
Controls movement intensity. Increase/decrease for more/less movement.
min_resolution Type: integerDefault: 512Range: 256 - 1024
Minimum image resolution for processing. Lower values use less memory but may reduce quality.
inference_steps Type: integerDefault: 25Range: 5 - 50
Number of diffusion steps. Higher values may improve quality but take longer.
keep_resolution Type: booleanDefault: false
If true, output video matches the original image resolution. Otherwise uses the min_resolution after cropping.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
Starting prediction...
Saved input image to: /src/tmp_path/input_image.png
Converted and saved audio to: /src/tmp_path/input_audio.wav
Preprocessing image...
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Face detection result: 1 face(s) found
Using original image for processing (no face detected)
Generating talking face animation...
  0%|          | 0/186 [00:00<?, ?it/s]
 87%|████████▋ | 161/186 [00:00<00:00, 1609.53it/s]
100%|██████████| 186/186 [00:00<00:00, 1630.35it/s]
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:12<05:07, 12.81s/it]
  8%|▊         | 2/25 [00:26<05:05, 13.26s/it]
 12%|█▏        | 3/25 [00:40<04:55, 13.43s/it]
 16%|█▌        | 4/25 [00:53<04:43, 13.51s/it]
 20%|██        | 5/25 [01:07<04:31, 13.56s/it]
 24%|██▍       | 6/25 [01:20<04:18, 13.58s/it]
 28%|██▊       | 7/25 [01:34<04:04, 13.60s/it]
 32%|███▏      | 8/25 [01:48<03:51, 13.62s/it]
 36%|███▌      | 9/25 [02:01<03:38, 13.63s/it]
 40%|████      | 10/25 [02:15<03:24, 13.64s/it]
 44%|████▍     | 11/25 [02:29<03:11, 13.65s/it]
 48%|████▊     | 12/25 [02:42<02:57, 13.65s/it]
 52%|█████▏    | 13/25 [02:56<02:43, 13.65s/it]
 56%|█████▌    | 14/25 [03:10<02:30, 13.66s/it]
 60%|██████    | 15/25 [03:23<02:16, 13.66s/it]
 64%|██████▍   | 16/25 [03:37<02:02, 13.65s/it]
 68%|██████▊   | 17/25 [03:51<01:49, 13.65s/it]
 72%|███████▏  | 18/25 [04:04<01:35, 13.66s/it]
 76%|███████▌  | 19/25 [04:18<01:21, 13.66s/it]
 80%|████████  | 20/25 [04:32<01:08, 13.65s/it]
 84%|████████▍ | 21/25 [04:45<00:54, 13.65s/it]
 88%|████████▊ | 22/25 [04:59<00:40, 13.65s/it]
 92%|█████████▏| 23/25 [05:13<00:27, 13.65s/it]
 96%|█████████▌| 24/25 [05:26<00:13, 13.65s/it]
100%|██████████| 25/25 [05:40<00:00, 13.64s/it]
100%|██████████| 25/25 [05:40<00:00, 13.61s/it]
  0% 0/185 [00:00<?, ?it/s]
  4% 7/185 [00:00<00:02, 63.60it/s]
  9% 16/185 [00:00<00:02, 77.96it/s]
 14% 25/185 [00:00<00:01, 82.84it/s]
 18% 34/185 [00:00<00:01, 85.22it/s]
 23% 43/185 [00:00<00:01, 86.55it/s]
 28% 52/185 [00:00<00:01, 87.36it/s]
 33% 61/185 [00:00<00:01, 87.91it/s]
 38% 70/185 [00:00<00:01, 88.23it/s]
 43% 79/185 [00:00<00:01, 88.45it/s]
 48% 88/185 [00:01<00:01, 88.58it/s]
 52% 97/185 [00:01<00:00, 88.71it/s]
 57% 106/185 [00:01<00:00, 88.78it/s]
 62% 115/185 [00:01<00:00, 88.81it/s]
 67% 124/185 [00:01<00:00, 88.84it/s]
 72% 133/185 [00:01<00:00, 88.87it/s]
 77% 142/185 [00:01<00:00, 88.89it/s]
 82% 151/185 [00:01<00:00, 88.92it/s]
 86% 160/185 [00:01<00:00, 88.92it/s]
 91% 169/185 [00:01<00:00, 88.91it/s]
 96% 178/185 [00:02<00:00, 88.88it/s]
100% 185/185 [00:02<00:00, 87.59it/s]
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil      56. 70.100 / 56. 70.100
libavcodec     58.134.100 / 58.134.100
libavformat    58. 76.100 / 58. 76.100
libavdevice    58. 13.100 / 58. 13.100
libavfilter     7.110.100 /  7.110.100
libswscale      5.  9.100 /  5.  9.100
libswresample   3.  9.100 /  3.  9.100
libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/src/res_path/output_noaudio.mp4':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
encoder         : Lavf58.29.100
Duration: 00:00:14.84, start: 0.000000, bitrate: 1048 kb/s
Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 512x896, 1045 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
Metadata:
handler_name    : VideoHandler
vendor_id       : [0][0][0][0]
Guessed Channel Layout for Input Stream #1.0 : stereo
Input #1, wav, from '/src/tmp_path/input_audio.wav':
Duration: 00:00:14.96, bitrate: 1411 kb/s
Stream #1:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))
Stream #1:0 -> #0:1 (pcm_s16le (native) -> aac (native))
Press [q] to stop, [?] for help
[libx264 @ 0x5e6466012380] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
[libx264 @ 0x5e6466012380] profile High, level 3.1, 4:2:0, 8-bit
[libx264 @ 0x5e6466012380] 264 - core 163 r3060 5db6aa6 - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=15 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=18.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to '/src/res_path/output.mp4':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
encoder         : Lavf58.76.100
Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 512x896, q=2-31, 25 fps, 12800 tbn (default)
Metadata:
handler_name    : VideoHandler
vendor_id       : [0][0][0][0]
encoder         : Lavc58.134.100 libx264
Side data:
cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s
Metadata:
encoder         : Lavc58.134.100 aac
frame=    1 fps=0.0 q=0.0 size=       0kB time=00:00:00.00 bitrate=N/A speed=   0x
frame=  158 fps=0.0 q=23.0 size=     768kB time=00:00:03.84 bitrate=1638.5kbits/s speed=7.38x
frame=  284 fps=275 q=23.0 size=    1536kB time=00:00:08.88 bitrate=1417.0kbits/s speed=8.59x
frame=  371 fps=243 q=-1.0 Lsize=    2834kB time=00:00:14.83 bitrate=1564.8kbits/s speed=9.72x
video:2587kB audio:234kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.479172%
[libx264 @ 0x5e6466012380] frame I:2     Avg QP:15.30  size: 76896
[libx264 @ 0x5e6466012380] frame P:100   Avg QP:17.22  size: 19413
[libx264 @ 0x5e6466012380] frame B:269   Avg QP:23.10  size:  2057
[libx264 @ 0x5e6466012380] consecutive B-frames:  1.1%  5.9%  2.4% 90.6%
[libx264 @ 0x5e6466012380] mb I  I16..4:  1.6% 64.5% 33.8%
[libx264 @ 0x5e6466012380] mb P  I16..4:  0.0%  0.7%  0.4%  P16..4: 31.4% 28.0% 20.0%  0.0%  0.0%    skip:19.4%
[libx264 @ 0x5e6466012380] mb B  I16..4:  0.0%  0.0%  0.0%  B16..8: 35.5%  4.3%  1.1%  direct: 1.1%  skip:58.0%  L0:35.6% L1:53.4% BI:11.0%
[libx264 @ 0x5e6466012380] 8x8 transform intra:63.6% inter:59.4%
[libx264 @ 0x5e6466012380] coded y,uvDC,uvAC intra: 93.5% 93.0% 69.6% inter: 15.9% 13.3% 0.6%
[libx264 @ 0x5e6466012380] i16 v,h,dc,p:  7% 17% 23% 53%
[libx264 @ 0x5e6466012380] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 18% 12% 10%  7% 10% 11%  9% 14%  9%
[libx264 @ 0x5e6466012380] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 17% 11%  7%  6% 17% 13% 12% 10%  7%
[libx264 @ 0x5e6466012380] i8c dc,h,v,p: 41% 21% 24% 14%
[libx264 @ 0x5e6466012380] Weighted P-Frames: Y:13.0% UV:6.0%
[libx264 @ 0x5e6466012380] ref P L0: 61.9% 21.7% 13.1%  3.0%  0.3%
[libx264 @ 0x5e6466012380] ref B L0: 94.9%  4.4%  0.7%
[libx264 @ 0x5e6466012380] ref B L1: 97.8%  2.2%
[libx264 @ 0x5e6466012380] kb/s:1427.75
[aac @ 0x5e6466014240] Qavg: 186.558
Video generation complete
Total prediction time: 395.92 seconds
Version Details
Version ID
a8b068b27f183677c4dbd0f58574cbc07bb7f496cd1b3750bcd37586f1421abd
Version Created
June 7, 2025
Run on Replicate →