zsxkib/humo 🔢🖼️❓📝 → 🖼️
About
Example Output
Prompt:
"A person walking confidently down a busy street"
Output
Performance Metrics
1419.74s
Prediction Time
1465.44s
Total Time
All Input Parameters
{ "width": 1280, "height": 720, "prompt": "A person walking confidently down a busy street", "num_frames": 49, "guidance_scale": 5, "negative_prompt": "blurry, low quality, distorted, bad anatomy", "num_inference_steps": 50, "audio_guidance_scale": 5.5 }
Input Parameters
- seed
- Random seed for reproducible generation
- audio
- Audio file for lip-sync and movement synchronization (optional)
- width
- Video width in pixels
- height
- Video height in pixels
- prompt
- Text description of the video. Be detailed about the person, actions, and scene.
- num_frames
- Number of frames (25 fps, so 25 frames = 1 second). Model trained on up to 97 frames.
- guidance_scale
- Text guidance strength. Research default is 5.0. Lower values (3-5) often produce more natural lighting.
- negative_prompt
- What to avoid in the video
- reference_image
- Reference image to control the person's appearance (optional)
- num_inference_steps
- Denoising steps. More steps = higher quality but slower. Research default is 50.
- audio_guidance_scale
- Audio guidance strength (when audio provided). Higher = better sync. Research default is 5.5.
Output Schema
Output
Example Execution Logs
🎬 Generating 2.0s video (1280x720, 49 frames) 📝 Mode: T (engine=TA) | Steps: 50 | Seed: 25775 🎥 Generation complete! 🎬 Saved video ✅ Success: 2.0s video at 1280x720
Version Details
- Version ID
d9b5555b1e87f11ef46b96834ecc379fabdaff97006b48564fe3d841561ab4ef
- Version Created
- September 18, 2025