zsxkib/thinksound ππ’πΌοΈ β πΌοΈ
About
Generate contextual audio from video using step-by-step reasoningπΆ
Example Output
Output
Performance Metrics
7.05s
Prediction Time
211.89s
Total Time
All Input Parameters
{ "cot": "Begin with the sound of hands scooping up loose plastic debris, followed by the subtle cascading noise as the pieces fall and scatter back down. Include soft crinkling and rustling to emphasize the texture of the plastic. Add ambient factory background noise with distant machinery to create an industrial atmosphere.", "video": "https://replicate.delivery/pbxt/NKmVPTu8dfrIHlOCh3QMOs12SuHpYhMDA7hLvxcqYwqJyU7P/replicate-prediction-ebnsmny609rma0cqxycbn5qq60_silenced.mp4", "caption": "Plastic Debris Handling", "cfg_scale": 5, "num_inference_steps": 24 }
Input Parameters
- cot
- Chain-of-Thought description providing detailed reasoning about the desired audio (optional)
- seed
- Random seed for reproducible outputs. Leave empty for random seed
- video (required)
- Input video file (supports various formats)
- caption
- Caption/title describing the video content (optional)
- cfg_scale
- Classifier-free guidance scale. Higher values follow conditioning more closely but may reduce creativity
- num_inference_steps
- Number of diffusion denoising steps. More steps = higher quality but slower generation
Output Schema
Output
Example Execution Logs
2025-07-10 15:37:16,299 - INFO - π¬ Starting ThinkSound audio generation... Seed set to 187100619 2025-07-10 15:37:16,300 - INFO - π² Using seed: 187100619 2025-07-10 15:37:16,395 - INFO - πΉ Video duration: 9.10 seconds 2025-07-10 15:37:16,395 - INFO - π Extracting multi-modal features... 2025-07-10 15:37:16,843 - INFO - Processing video: /tmp/tmpe8npov33replicate-prediction-ebnsmny609rma0cqxycbn5qq60_silenced.mp4 2025-07-10 15:37:20,360 - INFO - Processing text features... 2025-07-10 15:37:20,501 - INFO - Processing video features... 2025-07-10 15:37:22,049 - INFO - π Extracted features - CLIP: torch.Size([72, 1024]), Sync: torch.Size([216, 768]) 2025-07-10 15:37:22,049 - INFO - π΅ Generating audio with diffusion model... 2025-07-10 15:37:22,057 - INFO - Running 24 diffusion steps with CFG scale 5.0... 0it [00:00, ?it/s] 1it [00:00, 9.51it/s] 6it [00:00, 29.39it/s] 10it [00:00, 33.50it/s] 14it [00:00, 35.53it/s] 18it [00:00, 36.68it/s] 22it [00:00, 37.39it/s] 24it [00:00, 34.82it/s] 2025-07-10 15:37:22,849 - INFO - ποΈ Combining audio with video... 2025-07-10 15:37:23,171 - INFO - Video and audio combined successfully 2025-07-10 15:37:23,171 - INFO - β Audio generation completed successfully!
Version Details
- Version ID
40d08f9f569e91a5d72f6795ebed75178c185b0434699a98c07fc5f566efb2d4
- Version Created
- July 9, 2025