pku-yuangroup/llava-cot
Answer questions about images with step-by-step reasoning. Take an image and an optional text prompt and output text, in...
Found 150 models (showing 121-140)
Answer questions about images with step-by-step reasoning. Take an image and an optional text prompt and output text, in...
Answer questions about images from an image and text prompt, returning text responses. Perform visual question answering...
Answer questions about images and generate captions from an image and a text prompt, returning text. Perform visual ques...
Extract text from images and documents in 90+ languages with OCR, returning plain text plus optional structured layout....
Generate text from prompts or chat and analyze images to produce captions and grounded answers. Accepts text and optiona...
Generates descriptive text captions from three input images using arithmetic operations on image features. The model com...
Classify images into categories. Accepts a single image and returns top predicted classes with probabilities using a Res...
Extract text and document structure from an input image into Markdown or plain text. Handle PDFs, scans, screenshots, re...
Classify the nationality of a national ID card from an image. Accepts a photo or scan of an ID document and returns the...
Generate image captions from a single image input. Select from COCO or Conceptual Captions modes and optionally use beam...
Caption images. Accepts an image input and generates a zero-shot natural-language description, optionally conditioned by...
Generate text and multimodal analyses from text, image, and video inputs. Handle very long contexts (around 1M tokens) f...
Classify age range from an input image. Accepts a single image and predicts a discrete age-range label with a confidence...
Analyze images or video and generate text captions, answers, and summaries. Accepts single or multiple images or a video...
Extract structured data, answer visual questions, and summarize videos from images and videos. Accepts 1–4 images or a v...
Caption images and videos and answer visual questions. Accepts an optional image or video plus a text prompt and returns...
Caption images and answer visual questions from an input image, returning text. Accept an image and an optional instruct...
Moderate images for safety and policy compliance. Takes an image input (optionally a custom prompt) and outputs structur...
Caption images by generating detailed, paragraph-length natural-language descriptions from a single image input. Outputs...
Extract text and document structure from images into plain text or Markdown. Accept an image and a task type (markdown,...