visual-question-answering AI Models

spuuntries/urna-kp3l

Caption images and answer visual questions from an image and a text prompt. Accepts an input image and an instruction (e...

🖼️ → 📝 • image-to-text • image-captioning • visual-question-answering • 108 runs

🤖 Model 🖼️ → 📝

bytedance/sa2va-26b-image

Segment objects in images from natural-language instructions and answer visual questions. Provide an image plus a text i...

🖼️ → 📝 • image-segmentation • image-to-text • 6.4K runs

🤖 Model 🖼️ → 📝

bytedance/sa2va-8b-image

Analyzes images with text instructions to provide visual understanding and object segmentation. Combines SAM2 segmentati...

🖼️ → 📝 • image-to-text • image-segmentation • visual-understanding • 48.3K runs

🤖 Model 🖼️ → 📝

salesforce/blip

Generate image captions, answer questions about images, or match images with text descriptions. Supports three main task...

🖼️ → 📝 • image-to-text • visual-question-answering • 173.0M runs

🤖 Model 🖼️ → 📝

andreasjansson/blip-2

Answers questions about images and generates image captions. Takes an image and a text question as input, returning a te...

🖼️ → 📝 • image-to-text • visual-understanding • 31.8M runs

🤖 Model 🖼️ → 📝

zsxkib/blip-3

Answers questions about images and generates image captions using BLIP-3/XGen-MM multimodal model. Takes an image and a...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 1.3M runs

🤖 Model 🖼️ → 📝

yorickvp/llava-13b

Analyzes images and answers questions about them through conversational text generation. Combines visual understanding w...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 35.8M runs

🤖 Model 🖼️ → 📝

lucataco/moondream2

Analyzes images and generates text descriptions based on visual content and optional prompts. This small vision language...

🖼️ → 📝 • image-to-text • visual-understanding • ocr • 13.1M runs

🤖 Model 🖼️ → 📝

deepseek-ai/deepseek-vl2

Analyzes images and answers questions about visual content using a Mixture-of-Experts architecture. Takes an image and t...

🖼️ → 📝 • image-to-text • ocr • text-generation • 98.6K runs

🤖 Model 🖼️ → 📝

cjwbw/cogvlm

Analyzes images and answers questions about them using a visual language model. Takes an image and a text query as input...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 1.5M runs

🤖 Model 🖼️ → 📝

deepseek-ai/janus-pro-7b

Answers questions about images through multimodal understanding. Takes an image and a text question as input and generat...

🖼️ → 📝 • image-to-text • text-generation • visual-understanding • 13.9K runs

🤖 Model 🖼️ → 📝

lucataco/bakllava

Caption images and answer questions about images. Takes an image and a text prompt as input and returns text, enabling i...

🖼️ → 📝 • image-to-text • visual-question-answering • visual-understanding • 39.8K runs

🤖 Model 🖼️ → 📝

adirik/vila-2.7b

Analyzes images and generates text responses to questions about the visual content. Takes an image and text prompt as in...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 2.6K runs

🤖 Model 🖼️ → 📝

lucataco/qvq-72b-preview

Analyzes images and answers questions about visual content with enhanced reasoning capabilities. Takes an image and text...

🖼️ → 📝 • image-to-text • visual-understanding • text-generation • 297 runs

🤖 Model 🖼️ → 📝

naklecha/cogvlm

Analyzes images and responds to text prompts about visual content. Takes an image and a text prompt as input, then gener...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 12.5K runs

🤖 Model 🖼️ → 📝

lucataco/idefics-8b

Accepts arbitrary sequences of image and text inputs to produce text outputs for multimodal tasks. Answers questions abo...

🖼️ → 📝 • image-to-text • visual-understanding • ocr • 1.2K runs

🤖 Model 🖼️ → 📝

adirik/vila-7b

Answers questions about images using natural language. Takes an image and text prompt as input and generates contextual...

🖼️ → 📝 • image-to-text • visual-understanding • question-answering • 7.4K runs

🤖 Model 🖼️ → 📝

deepseek-ai/janus-pro-1b

Analyzes images and answers questions about them using a unified autoregressive framework for multimodal understanding....

🖼️ → 📝 • image-to-text • visual-understanding • multimodal • 6.7K runs

🤖 Model 🖼️ → 📝

zsxkib/idefics3

Answers questions about images and generates detailed captions based on visual content and text prompts. Processes both...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 2.7K runs

🤖 Model 🖼️ → 📝

lucataco/smolvlm-instruct

Analyzes images and generates text responses based on visual content and text prompts. Accepts arbitrary sequences of im...

🖼️ → 📝 • image-to-text • visual-understanding • document-understanding • 8.3K runs