zsxkib/thinksound πŸ“πŸ”’πŸ–ΌοΈ β†’ πŸ–ΌοΈ

▢️ 6.2K runs πŸ“… Jul 2025 βš™οΈ Cog 0.15.9 πŸ”— GitHub πŸ“„ Paper βš–οΈ License
foley sound-design sound-effect-generation video-to-audio

About

Generate contextual audio from video using step-by-step reasoning🎢

Example Output

Output

Performance Metrics

7.05s Prediction Time
211.89s Total Time
All Input Parameters
{
  "cot": "Begin with the sound of hands scooping up loose plastic debris, followed by the subtle cascading noise as the pieces fall and scatter back down. Include soft crinkling and rustling to emphasize the texture of the plastic. Add ambient factory background noise with distant machinery to create an industrial atmosphere.",
  "video": "https://replicate.delivery/pbxt/NKmVPTu8dfrIHlOCh3QMOs12SuHpYhMDA7hLvxcqYwqJyU7P/replicate-prediction-ebnsmny609rma0cqxycbn5qq60_silenced.mp4",
  "caption": "Plastic Debris Handling",
  "cfg_scale": 5,
  "num_inference_steps": 24
}
Input Parameters
cot Type: stringDefault:
Chain-of-Thought description providing detailed reasoning about the desired audio (optional)
seed Type: integer
Random seed for reproducible outputs. Leave empty for random seed
video (required) Type: string
Input video file (supports various formats)
caption Type: stringDefault:
Caption/title describing the video content (optional)
cfg_scale Type: numberDefault: 5Range: 1 - 20
Classifier-free guidance scale. Higher values follow conditioning more closely but may reduce creativity
num_inference_steps Type: integerDefault: 24Range: 10 - 100
Number of diffusion denoising steps. More steps = higher quality but slower generation
Output Schema

Output

Type: string β€’ Format: uri

Example Execution Logs
2025-07-10 15:37:16,299 - INFO - 🎬 Starting ThinkSound audio generation...
Seed set to 187100619
2025-07-10 15:37:16,300 - INFO - 🎲 Using seed: 187100619
2025-07-10 15:37:16,395 - INFO - πŸ“Ή Video duration: 9.10 seconds
2025-07-10 15:37:16,395 - INFO - πŸ” Extracting multi-modal features...
2025-07-10 15:37:16,843 - INFO - Processing video: /tmp/tmpe8npov33replicate-prediction-ebnsmny609rma0cqxycbn5qq60_silenced.mp4
2025-07-10 15:37:20,360 - INFO - Processing text features...
2025-07-10 15:37:20,501 - INFO - Processing video features...
2025-07-10 15:37:22,049 - INFO - πŸ“Š Extracted features - CLIP: torch.Size([72, 1024]), Sync: torch.Size([216, 768])
2025-07-10 15:37:22,049 - INFO - 🎡 Generating audio with diffusion model...
2025-07-10 15:37:22,057 - INFO - Running 24 diffusion steps with CFG scale 5.0...
0it [00:00, ?it/s]
1it [00:00,  9.51it/s]
6it [00:00, 29.39it/s]
10it [00:00, 33.50it/s]
14it [00:00, 35.53it/s]
18it [00:00, 36.68it/s]
22it [00:00, 37.39it/s]
24it [00:00, 34.82it/s]
2025-07-10 15:37:22,849 - INFO - 🎞️ Combining audio with video...
2025-07-10 15:37:23,171 - INFO - Video and audio combined successfully
2025-07-10 15:37:23,171 - INFO - βœ… Audio generation completed successfully!
Version Details
Version ID
40d08f9f569e91a5d72f6795ebed75178c185b0434699a98c07fc5f566efb2d4
Version Created
July 9, 2025
Run on Replicate β†’