zf-kbot/sonic 🖼️🔢✓ → 🖼️

▶️ 147.0K runs 📅 Aug 2025 ⚙️ Cog 0.12.0
lipsync talking-head

About

Transform photos into lifelike talking animations with our AI Talking Photo Generator.Perfect for any channels. Create talking head with ai!

Example Output

Output

Performance Metrics

151.90s Prediction Time
151.91s Total Time
All Input Parameters
{
  "audio": "https://replicate.delivery/pbxt/O9gW0wwmiEzPPGd73rv7I1AH5MfblycQleoy9R7ICagvjvL1/examples_wav_talk_male_law_10s.wav",
  "image": "https://replicate.delivery/pbxt/O9gW0Pw3NhRjMXkTHnJ9dzZFpU95B9eN0KTM54gpW1NxosEH/10532f78-439f-49f1-9d6f-b92fcf83f12a.png",
  "dynamic_scale": 1,
  "min_resolution": 512,
  "inference_steps": 25,
  "keep_resolution": false
}
Input Parameters
audio (required) Type: string
Input audio file (WAV, MP3, etc.) for the voice.
image (required) Type: string
Input portrait image (will be cropped if face is detected).
dynamic_scale Type: numberDefault: 1Range: 0.5 - 2
Controls movement intensity. Increase/decrease for more/less movement.
min_resolution Type: integerDefault: 512Range: 256 - 1024
Minimum image resolution for processing. Lower values use less memory but may reduce quality.
inference_steps Type: integerDefault: 25Range: 5 - 50
Number of diffusion steps. Higher values may improve quality but take longer.
keep_resolution Type: booleanDefault: false
If true, output video matches the original image resolution. Otherwise uses the min_resolution after cropping.
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
Starting prediction...
Saved input image to: /src/tmp_path/input_image.png
Converted and saved audio to: /src/tmp_path/input_audio.wav
Preprocessing image...
/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Face detection result: 1 face(s) found
Generating talking face animation...
  0%|          | 0/125 [00:00<?, ?it/s]
100%|██████████| 125/125 [00:00<00:00, 1670.30it/s]
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:04<01:42,  4.29s/it]
  8%|▊         | 2/25 [00:08<01:44,  4.52s/it]
 12%|█▏        | 3/25 [00:13<01:41,  4.61s/it]
 16%|█▌        | 4/25 [00:18<01:37,  4.65s/it]
 20%|██        | 5/25 [00:23<01:33,  4.67s/it]
 24%|██▍       | 6/25 [00:27<01:29,  4.69s/it]
 28%|██▊       | 7/25 [00:32<01:24,  4.70s/it]
 32%|███▏      | 8/25 [00:37<01:19,  4.71s/it]
 36%|███▌      | 9/25 [00:41<01:15,  4.71s/it]
 40%|████      | 10/25 [00:46<01:10,  4.71s/it]
 44%|████▍     | 11/25 [00:51<01:06,  4.72s/it]
 48%|████▊     | 12/25 [00:56<01:01,  4.72s/it]
 52%|█████▏    | 13/25 [01:00<00:56,  4.72s/it]
 56%|█████▌    | 14/25 [01:05<00:51,  4.71s/it]
 60%|██████    | 15/25 [01:10<00:47,  4.71s/it]
 64%|██████▍   | 16/25 [01:14<00:42,  4.71s/it]
 68%|██████▊   | 17/25 [01:19<00:37,  4.71s/it]
 72%|███████▏  | 18/25 [01:24<00:32,  4.71s/it]
 76%|███████▌  | 19/25 [01:29<00:28,  4.71s/it]
 80%|████████  | 20/25 [01:33<00:23,  4.71s/it]
 84%|████████▍ | 21/25 [01:38<00:18,  4.71s/it]
 88%|████████▊ | 22/25 [01:43<00:14,  4.71s/it]
 92%|█████████▏| 23/25 [01:47<00:09,  4.71s/it]
 96%|█████████▌| 24/25 [01:52<00:04,  4.71s/it]
100%|██████████| 25/25 [01:57<00:00,  4.71s/it]
100%|██████████| 25/25 [01:57<00:00,  4.69s/it]
  0% 0/124 [00:00<?, ?it/s]
  5% 6/124 [00:00<00:02, 58.22it/s]
 15% 19/124 [00:00<00:01, 95.60it/s]
 26% 32/124 [00:00<00:00, 107.62it/s]
 36% 45/124 [00:00<00:00, 113.23it/s]
 47% 58/124 [00:00<00:00, 116.43it/s]
 57% 71/124 [00:00<00:00, 118.48it/s]
 68% 84/124 [00:00<00:00, 119.47it/s]
 78% 97/124 [00:00<00:00, 120.61it/s]
 89% 110/124 [00:00<00:00, 121.32it/s]
 99% 123/124 [00:01<00:00, 121.77it/s]
100% 124/124 [00:01<00:00, 115.91it/s]
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil      56. 70.100 / 56. 70.100
libavcodec     58.134.100 / 58.134.100
libavformat    58. 76.100 / 58. 76.100
libavdevice    58. 13.100 / 58. 13.100
libavfilter     7.110.100 /  7.110.100
libswscale      5.  9.100 /  5.  9.100
libswresample   3.  9.100 /  3.  9.100
libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/src/res_path/output_noaudio.mp4':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
encoder         : Lavf58.29.100
Duration: 00:00:09.96, start: 0.000000, bitrate: 386 kb/s
Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 512x512, 383 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
Metadata:
handler_name    : VideoHandler
vendor_id       : [0][0][0][0]
Guessed Channel Layout for Input Stream #1.0 : stereo
Input #1, wav, from '/src/tmp_path/input_audio.wav':
Duration: 00:00:10.00, bitrate: 1411 kb/s
Stream #1:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))
Stream #1:0 -> #0:1 (pcm_s16le (native) -> aac (native))
Press [q] to stop, [?] for help
[libx264 @ 0x60a8fb80b7c0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
[libx264 @ 0x60a8fb80b7c0] profile High, level 3.0, 4:2:0, 8-bit
[libx264 @ 0x60a8fb80b7c0] 264 - core 163 r3060 5db6aa6 - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=15 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=18.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to '/src/res_path/output.mp4':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
encoder         : Lavf58.76.100
Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 512x512, q=2-31, 25 fps, 12800 tbn (default)
Metadata:
handler_name    : VideoHandler
vendor_id       : [0][0][0][0]
encoder         : Lavc58.134.100 libx264
Side data:
cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s
Metadata:
encoder         : Lavc58.134.100 aac
frame=    1 fps=0.0 q=0.0 size=       0kB time=00:00:00.00 bitrate=N/A speed=   0x
frame=  214 fps=0.0 q=23.0 size=     512kB time=00:00:06.08 bitrate= 689.5kbits/s speed=11.7x
frame=  249 fps=0.0 q=-1.0 Lsize=     899kB time=00:00:09.93 bitrate= 741.1kbits/s speed=13.5x
video:731kB audio:158kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.070527%
[libx264 @ 0x60a8fb80b7c0] frame I:1     Avg QP:16.20  size: 24156
[libx264 @ 0x60a8fb80b7c0] frame P:68    Avg QP:15.98  size:  7920
[libx264 @ 0x60a8fb80b7c0] frame B:180   Avg QP:20.68  size:  1031
[libx264 @ 0x60a8fb80b7c0] consecutive B-frames:  0.8%  8.0%  1.2% 90.0%
[libx264 @ 0x60a8fb80b7c0] mb I  I16..4:  5.2% 82.7% 12.1%
[libx264 @ 0x60a8fb80b7c0] mb P  I16..4:  0.3%  1.1%  0.1%  P16..4: 40.0% 29.0% 16.4%  0.0%  0.0%    skip:13.0%
[libx264 @ 0x60a8fb80b7c0] mb B  I16..4:  0.0%  0.0%  0.0%  B16..8: 44.9%  2.9%  0.4%  direct: 0.7%  skip:51.1%  L0:43.0% L1:50.4% BI: 6.6%
[libx264 @ 0x60a8fb80b7c0] 8x8 transform intra:75.3% inter:69.3%
[libx264 @ 0x60a8fb80b7c0] coded y,uvDC,uvAC intra: 68.9% 84.8% 61.8% inter: 13.5% 19.7% 2.1%
[libx264 @ 0x60a8fb80b7c0] i16 v,h,dc,p: 36% 10% 22% 31%
[libx264 @ 0x60a8fb80b7c0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 19% 15% 20%  6%  7% 10%  7%  9%  7%
[libx264 @ 0x60a8fb80b7c0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 23% 21% 10%  6% 11% 10%  9%  5%  6%
[libx264 @ 0x60a8fb80b7c0] i8c dc,h,v,p: 42% 20% 26% 13%
[libx264 @ 0x60a8fb80b7c0] Weighted P-Frames: Y:10.3% UV:8.8%
[libx264 @ 0x60a8fb80b7c0] ref P L0: 59.7% 19.3% 15.8%  5.0%  0.2%
[libx264 @ 0x60a8fb80b7c0] ref B L0: 92.9%  5.9%  1.1%
[libx264 @ 0x60a8fb80b7c0] ref B L1: 97.1%  2.9%
[libx264 @ 0x60a8fb80b7c0] kb/s:601.07
[aac @ 0x60a8fb80d200] Qavg: 1053.562
Video generation complete
Total prediction time: 151.26 seconds
Version Details
Version ID
c6d80220ce71d8df04d5dbf2b189b70b9f4937aea6a030de12cb46951b24d134
Version Created
August 5, 2025
Run on Replicate →