zf-kbot/sonic 🖼️🔢✓ → 🖼️
About
Transform photos into lifelike talking animations with our AI Talking Photo Generator.Perfect for any channels. Create talking head with ai!
Example Output
Output
Performance Metrics
151.90s
Prediction Time
151.91s
Total Time
All Input Parameters
{
"audio": "https://replicate.delivery/pbxt/O9gW0wwmiEzPPGd73rv7I1AH5MfblycQleoy9R7ICagvjvL1/examples_wav_talk_male_law_10s.wav",
"image": "https://replicate.delivery/pbxt/O9gW0Pw3NhRjMXkTHnJ9dzZFpU95B9eN0KTM54gpW1NxosEH/10532f78-439f-49f1-9d6f-b92fcf83f12a.png",
"dynamic_scale": 1,
"min_resolution": 512,
"inference_steps": 25,
"keep_resolution": false
}
Input Parameters
- audio (required)
- Input audio file (WAV, MP3, etc.) for the voice.
- image (required)
- Input portrait image (will be cropped if face is detected).
- dynamic_scale
- Controls movement intensity. Increase/decrease for more/less movement.
- min_resolution
- Minimum image resolution for processing. Lower values use less memory but may reduce quality.
- inference_steps
- Number of diffusion steps. Higher values may improve quality but take longer.
- keep_resolution
- If true, output video matches the original image resolution. Otherwise uses the min_resolution after cropping.
Output Schema
Output
Example Execution Logs
Starting prediction... Saved input image to: /src/tmp_path/input_image.png Converted and saved audio to: /src/tmp_path/input_audio.wav Preprocessing image... /root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3549.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Face detection result: 1 face(s) found Generating talking face animation... 0%| | 0/125 [00:00<?, ?it/s] 100%|██████████| 125/125 [00:00<00:00, 1670.30it/s] 0%| | 0/25 [00:00<?, ?it/s] 4%|▍ | 1/25 [00:04<01:42, 4.29s/it] 8%|▊ | 2/25 [00:08<01:44, 4.52s/it] 12%|█▏ | 3/25 [00:13<01:41, 4.61s/it] 16%|█▌ | 4/25 [00:18<01:37, 4.65s/it] 20%|██ | 5/25 [00:23<01:33, 4.67s/it] 24%|██▍ | 6/25 [00:27<01:29, 4.69s/it] 28%|██▊ | 7/25 [00:32<01:24, 4.70s/it] 32%|███▏ | 8/25 [00:37<01:19, 4.71s/it] 36%|███▌ | 9/25 [00:41<01:15, 4.71s/it] 40%|████ | 10/25 [00:46<01:10, 4.71s/it] 44%|████▍ | 11/25 [00:51<01:06, 4.72s/it] 48%|████▊ | 12/25 [00:56<01:01, 4.72s/it] 52%|█████▏ | 13/25 [01:00<00:56, 4.72s/it] 56%|█████▌ | 14/25 [01:05<00:51, 4.71s/it] 60%|██████ | 15/25 [01:10<00:47, 4.71s/it] 64%|██████▍ | 16/25 [01:14<00:42, 4.71s/it] 68%|██████▊ | 17/25 [01:19<00:37, 4.71s/it] 72%|███████▏ | 18/25 [01:24<00:32, 4.71s/it] 76%|███████▌ | 19/25 [01:29<00:28, 4.71s/it] 80%|████████ | 20/25 [01:33<00:23, 4.71s/it] 84%|████████▍ | 21/25 [01:38<00:18, 4.71s/it] 88%|████████▊ | 22/25 [01:43<00:14, 4.71s/it] 92%|█████████▏| 23/25 [01:47<00:09, 4.71s/it] 96%|█████████▌| 24/25 [01:52<00:04, 4.71s/it] 100%|██████████| 25/25 [01:57<00:00, 4.71s/it] 100%|██████████| 25/25 [01:57<00:00, 4.69s/it] 0% 0/124 [00:00<?, ?it/s] 5% 6/124 [00:00<00:02, 58.22it/s] 15% 19/124 [00:00<00:01, 95.60it/s] 26% 32/124 [00:00<00:00, 107.62it/s] 36% 45/124 [00:00<00:00, 113.23it/s] 47% 58/124 [00:00<00:00, 116.43it/s] 57% 71/124 [00:00<00:00, 118.48it/s] 68% 84/124 [00:00<00:00, 119.47it/s] 78% 97/124 [00:00<00:00, 120.61it/s] 89% 110/124 [00:00<00:00, 121.32it/s] 99% 123/124 [00:01<00:00, 121.77it/s] 100% 124/124 [00:01<00:00, 115.91it/s] ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/src/res_path/output_noaudio.mp4': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2avc1mp41 encoder : Lavf58.29.100 Duration: 00:00:09.96, start: 0.000000, bitrate: 386 kb/s Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 512x512, 383 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default) Metadata: handler_name : VideoHandler vendor_id : [0][0][0][0] Guessed Channel Layout for Input Stream #1.0 : stereo Input #1, wav, from '/src/tmp_path/input_audio.wav': Duration: 00:00:10.00, bitrate: 1411 kb/s Stream #1:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s Stream mapping: Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264)) Stream #1:0 -> #0:1 (pcm_s16le (native) -> aac (native)) Press [q] to stop, [?] for help [libx264 @ 0x60a8fb80b7c0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512 [libx264 @ 0x60a8fb80b7c0] profile High, level 3.0, 4:2:0, 8-bit [libx264 @ 0x60a8fb80b7c0] 264 - core 163 r3060 5db6aa6 - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=15 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=18.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00 Output #0, mp4, to '/src/res_path/output.mp4': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2avc1mp41 encoder : Lavf58.76.100 Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 512x512, q=2-31, 25 fps, 12800 tbn (default) Metadata: handler_name : VideoHandler vendor_id : [0][0][0][0] encoder : Lavc58.134.100 libx264 Side data: cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s Metadata: encoder : Lavc58.134.100 aac frame= 1 fps=0.0 q=0.0 size= 0kB time=00:00:00.00 bitrate=N/A speed= 0x frame= 214 fps=0.0 q=23.0 size= 512kB time=00:00:06.08 bitrate= 689.5kbits/s speed=11.7x frame= 249 fps=0.0 q=-1.0 Lsize= 899kB time=00:00:09.93 bitrate= 741.1kbits/s speed=13.5x video:731kB audio:158kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.070527% [libx264 @ 0x60a8fb80b7c0] frame I:1 Avg QP:16.20 size: 24156 [libx264 @ 0x60a8fb80b7c0] frame P:68 Avg QP:15.98 size: 7920 [libx264 @ 0x60a8fb80b7c0] frame B:180 Avg QP:20.68 size: 1031 [libx264 @ 0x60a8fb80b7c0] consecutive B-frames: 0.8% 8.0% 1.2% 90.0% [libx264 @ 0x60a8fb80b7c0] mb I I16..4: 5.2% 82.7% 12.1% [libx264 @ 0x60a8fb80b7c0] mb P I16..4: 0.3% 1.1% 0.1% P16..4: 40.0% 29.0% 16.4% 0.0% 0.0% skip:13.0% [libx264 @ 0x60a8fb80b7c0] mb B I16..4: 0.0% 0.0% 0.0% B16..8: 44.9% 2.9% 0.4% direct: 0.7% skip:51.1% L0:43.0% L1:50.4% BI: 6.6% [libx264 @ 0x60a8fb80b7c0] 8x8 transform intra:75.3% inter:69.3% [libx264 @ 0x60a8fb80b7c0] coded y,uvDC,uvAC intra: 68.9% 84.8% 61.8% inter: 13.5% 19.7% 2.1% [libx264 @ 0x60a8fb80b7c0] i16 v,h,dc,p: 36% 10% 22% 31% [libx264 @ 0x60a8fb80b7c0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 19% 15% 20% 6% 7% 10% 7% 9% 7% [libx264 @ 0x60a8fb80b7c0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 23% 21% 10% 6% 11% 10% 9% 5% 6% [libx264 @ 0x60a8fb80b7c0] i8c dc,h,v,p: 42% 20% 26% 13% [libx264 @ 0x60a8fb80b7c0] Weighted P-Frames: Y:10.3% UV:8.8% [libx264 @ 0x60a8fb80b7c0] ref P L0: 59.7% 19.3% 15.8% 5.0% 0.2% [libx264 @ 0x60a8fb80b7c0] ref B L0: 92.9% 5.9% 1.1% [libx264 @ 0x60a8fb80b7c0] ref B L1: 97.1% 2.9% [libx264 @ 0x60a8fb80b7c0] kb/s:601.07 [aac @ 0x60a8fb80d200] Qavg: 1053.562 Video generation complete Total prediction time: 151.26 seconds
Version Details
- Version ID
c6d80220ce71d8df04d5dbf2b189b70b9f4937aea6a030de12cb46951b24d134- Version Created
- August 5, 2025