bytedance/sa2va-26b-image 🖼️📝 → ❓

▶️ 11.7K runs 📅 Feb 2025 ⚙️ Cog 0.14.2 🔗 GitHub 📄 Paper ⚖️ License

image-segmentation image-to-text visual-grounding visual-question-answering

Performance

2.0sTypical run time

11.7KTotal runs

About

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Example Output

Output

{"img":"https://replicate.delivery/yhqm/XMIPgu8PAvqhKxFjrsAzCGgZFfnlbIA6SeNtEIqM2JmxJxRUA/output.png","response":"Sure, [SEG] .<|im_end|>"}

Performance Metrics

2.00s Prediction Time

2.01s Total Time

All Input Parameters

{
  "image": "https://replicate.delivery/pbxt/MXeFEYuz0b5rNtNmOhvMkzhAfJUWEa29ywD88KamZd6aegmD/replicate-prediction-bjg6qedsznrma0cn5gftx6w40r.webp",
  "instruction": "please segment the woman dancing in a blue dress"
}

Input Parameters

image (required) Type: string: Input image for segmentation
instruction (required) Type: string: Text instruction for the model

Output Schema

img Type: stringFormat: uri: Img
response Type: string: Response

Example Execution Logs

propagate in video:   0%|          | 0/1 [00:00<?, ?it/s]
propagate in video: 100%|██████████| 1/1 [00:00<00:00, 5548.02it/s]

Version Details

Version ID: addd35cc4f8e0761836ff1e4af324bd7b1f4fa67ee3d384b69202cb288a7dd4f
Version Created: March 20, 2025

Run on Replicate →