
chenxwh/deepseek-vl2
Answer questions and caption images from one to three input images, returning text. Handle visual question answering (VQ...
Found 49 models (showing 41-49)
Answer questions and caption images from one to three input images, returning text. Handle visual question answering (VQ...
Generate captions and answer visual questions for images and videos from a text prompt. Accepts a single image or a vide...
Caption images, detect objects, and extract text from an input image, returning text outputs. Accepts an image plus a ta...
Answer questions about images from an image and text prompt, returning text. Perform visual question answering (VQA), im...
Analyze images to generate captions, detect objects with bounding boxes, and extract text (OCR). Accepts an image plus a...
Answer questions about images and extract information, returning text. Accepts an image plus a text prompt and outputs t...
Answer questions about images and caption them from an image plus text prompt, returning text. Perform visual recognitio...
Parse UI screenshots into structured UI elements for agentic automation. Accepts an image of a desktop, mobile, or web i...
Extract structured receipt data from an image and return JSON. Takes a receipt photo or scan as input and outputs a JSON...