bytedance/sa2va-4b-image 🖼️📝 → ❓
About
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Example Output
Output
Performance Metrics
1.07s
Prediction Time
69.27s
Total Time
All Input Parameters
{ "image": "https://replicate.delivery/pbxt/MXVMYKMbmDEtKqsegGtgTgmZQAhDRXydmVnt0tRA65Cr8L3H/replicate-prediction-vc34d0cgt9rme0cn57f8qqzp8m.webp", "instruction": "segment the snowboarder" }
Input Parameters
- image (required)
- Input image for segmentation
- instruction (required)
- Text instruction for the model
Output Schema
- img
- Img
- response
- Response
Example Execution Logs
propagate in video: 0%| | 0/1 [00:00<?, ?it/s] propagate in video: 100%|██████████| 1/1 [00:00<00:00, 5440.08it/s]
Version Details
- Version ID
ccf8e58352ce5b2a6cbf5cb68b947ca3bc62009ce66e82071eeae14555414411
- Version Created
- February 22, 2025