zsxkib/thinksound 📝🔢🖼️ → 🖼️

▶️ 9.4K runs 📅 Jul 2025 ⚙️ Cog 0.15.9 🔗 GitHub 📄 Paper ⚖️ License

foley sound-design sound-effect-generation video-to-audio

Performance

7.0sTypical run time

~212sCold start (first call)

9.4KTotal runs

About

Generate contextual audio from video using step-by-step reasoning🎶

Example Output

Output

Performance Metrics

7.05s Prediction Time

211.89s Total Time

All Input Parameters

{
  "cot": "Begin with the sound of hands scooping up loose plastic debris, followed by the subtle cascading noise as the pieces fall and scatter back down. Include soft crinkling and rustling to emphasize the texture of the plastic. Add ambient factory background noise with distant machinery to create an industrial atmosphere.",
  "video": "https://replicate.delivery/pbxt/NKmVPTu8dfrIHlOCh3QMOs12SuHpYhMDA7hLvxcqYwqJyU7P/replicate-prediction-ebnsmny609rma0cqxycbn5qq60_silenced.mp4",
  "caption": "Plastic Debris Handling",
  "cfg_scale": 5,
  "num_inference_steps": 24
}

Input Parameters

cot Type: stringDefault:: Chain-of-Thought description providing detailed reasoning about the desired audio (optional)
seed Type: integer: Random seed for reproducible outputs. Leave empty for random seed
video (required) Type: string: Input video file (supports various formats)
caption Type: stringDefault:: Caption/title describing the video content (optional)
cfg_scale Type: numberDefault: 5Range: 1 - 20: Classifier-free guidance scale. Higher values follow conditioning more closely but may reduce creativity
num_inference_steps Type: integerDefault: 24Range: 10 - 100: Number of diffusion denoising steps. More steps = higher quality but slower generation

Output Schema

Output

Type: string • Format: uri

Example Execution Logs

2025-07-10 15:37:16,299 - INFO - 🎬 Starting ThinkSound audio generation...
Seed set to 187100619
2025-07-10 15:37:16,300 - INFO - 🎲 Using seed: 187100619
2025-07-10 15:37:16,395 - INFO - 📹 Video duration: 9.10 seconds
2025-07-10 15:37:16,395 - INFO - 🔍 Extracting multi-modal features...
2025-07-10 15:37:16,843 - INFO - Processing video: /tmp/tmpe8npov33replicate-prediction-ebnsmny609rma0cqxycbn5qq60_silenced.mp4
2025-07-10 15:37:20,360 - INFO - Processing text features...
2025-07-10 15:37:20,501 - INFO - Processing video features...
2025-07-10 15:37:22,049 - INFO - 📊 Extracted features - CLIP: torch.Size([72, 1024]), Sync: torch.Size([216, 768])
2025-07-10 15:37:22,049 - INFO - 🎵 Generating audio with diffusion model...
2025-07-10 15:37:22,057 - INFO - Running 24 diffusion steps with CFG scale 5.0...
0it [00:00, ?it/s]
1it [00:00,  9.51it/s]
6it [00:00, 29.39it/s]
10it [00:00, 33.50it/s]
14it [00:00, 35.53it/s]
18it [00:00, 36.68it/s]
22it [00:00, 37.39it/s]
24it [00:00, 34.82it/s]
2025-07-10 15:37:22,849 - INFO - 🎞️ Combining audio with video...
2025-07-10 15:37:23,171 - INFO - Video and audio combined successfully
2025-07-10 15:37:23,171 - INFO - ✅ Audio generation completed successfully!

Version Details

Version ID: 40d08f9f569e91a5d72f6795ebed75178c185b0434699a98c07fc5f566efb2d4
Version Created: July 9, 2025

Run on Replicate →