bytedance/sa2va-8b-image 🖼️📝 → ❓
About
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Example Output
Output
{"img":"https://replicate.delivery/xezq/94rYVsLtK0IfLiKPefc5zxeeXhFcGf0acrzUNY6uVSfVLa4IKA/output.png","response":"Sure, [SEG] .<|im_end|>"}
Performance Metrics
0.87s
Prediction Time
0.88s
Total Time
All Input Parameters
{ "image": "https://replicate.delivery/pbxt/MXdtc5yJDPoUGs6li6sYevHiNXWJjaD9O4kvCwYIAIWTHWsG/replicate-prediction-1spvj2jc8hrm80cn5f6t1xxg4m.webp", "instruction": "segment the giraffe" }
Input Parameters
- image (required)
- Input image for segmentation
- instruction (required)
- Text instruction for the model
Output Schema
- img
- Img
- response
- Response
Example Execution Logs
propagate in video: 0%| | 0/1 [00:00<?, ?it/s] propagate in video: 100%|██████████| 1/1 [00:00<00:00, 5599.87it/s]
Version Details
- Version ID
956baf05a8a81ab47f1d0dac8eab6585b899790f342975a964840c4e9c63c7aa
- Version Created
- February 22, 2025