visual-understanding AI Models - Page 4

paragekbote/gemma3-torchao-quant-sparse

Generate text and analyze images from a text prompt (optionally with an image), returning text for conversation, caption...

🖼️ → 📝 • text-generation • image-to-text • image-captioning • 54 runs

🤖 Model 🖼️ → 📝

lucataco/bakllava

Caption images and answer questions about images. Takes an image and a text prompt as input and returns text, enabling i...

🖼️ → 📝 • image-to-text • visual-question-answering • visual-understanding • 39.8K runs

🤖 Model 🖼️ → 📝

google/gemini-3-pro

Generates text responses from prompts with advanced reasoning capabilities, supporting multimodal inputs including image...

🖼️ → 📝 • text-generation • image-to-text • code-generation • 1.2M runs

🤖 Model 🖼️ → 📝

google/gemini-2.5-flash

Generate text responses from text, image, video, and audio inputs with controllable reasoning depth. Supports up to 1 mi...

🖼️ → 📝 • text-generation • image-to-text • video-to-text • 6.8M runs

🤖 Model 🖼️

cjwbw/internlm-xcomposer

Answer questions and caption images from a text prompt and an optional image, returning text. Generate long-form text an...

🖼️ • image-captioning • image-analysis • visual-understanding • 164.4K runs

🤖 Model 📝 → 📝

hayooucom/vision-model

Analyze images and generate detailed textual descriptions based on visual content. Supports input via image URLs or base...

📝 → 📝 • image-analysis • image-captioning • visual-understanding • 15.4K runs

🤖 Model 🖼️ → 📝

openai/gpt-5.2

Generate text responses with advanced reasoning capabilities for professional knowledge work, coding, and agentic tasks....

🖼️ → 📝 • text-generation • code-generation • image-to-text • 811.9K runs

🤖 Model 📝 → 📝

microsoft/phi-4-multimodal-instruct

Generate text responses from text, image, and audio inputs. Perform image captioning and visual question answering, OCR,...

📝 → 📝 • text-generation • speech-to-text • image-captioning • 13.2K runs

🤖 Model 🖼️ → 📝

seeghost1019/google-medgemma-4b-non-clinical

Generates text responses to medical questions and analyzes medical images for research and educational purposes. Based o...

🖼️ → 📝 • text-generation • image-to-text • medical • 94 runs

🤖 Model 🖼️ → 📝

openai/gpt-5.4

Generate text from prompts with configurable reasoning effort and verbosity for complex professional work, coding, and m...

🖼️ → 📝 • text-generation • code-generation • image-to-text • 121.5K runs

🤖 Model 🖼️ → 📝

openai/gpt-5.1

Generate text responses from prompts or conversations with configurable reasoning effort and verbosity. Designed specifi...

🖼️ → 📝 • text-generation • image-to-text • code-generation • 244.0K runs

🤖 Model 🖼️ → 📝

nelsonjchen/minigpt-4_vicuna-13b

Answer questions about images and generate detailed image captions using MiniGPT-4 with Vicuna-13B language model. Takes...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 52.0K runs

🤖 Model 🖼️ → 📝

anthropic/claude-opus-4.6

Generate text and analyze images with Anthropic's most advanced language model, featuring state-of-the-art coding, reaso...

🖼️ → 📝 • text-generation • image-to-text • code-generation • 167.9K runs

🤖 Model 🖼️ → 📝

anthropic/claude-opus-4.7

Generate text responses with advanced reasoning and visual understanding capabilities from text prompts and optional ima...

🖼️ → 📝 • text-generation • code-generation • image-to-text • 90.2K runs

🤖 Model 🖼️ → 📝

google/gemini-3.1-pro

Advanced multimodal language model that processes text, images, videos, and audio to generate text responses. Features t...

🖼️ → 📝 • text-generation • image-to-text • code-generation • 574.6K runs

🤖 Model 🖼️ → 📝

adirik/vila-7b

Answers questions about images using natural language. Takes an image and text prompt as input and generates contextual...

🖼️ → 📝 • image-to-text • visual-understanding • question-answering • 7.4K runs

🤖 Model 🖼️ → 📝

naklecha/cogvlm

Analyzes images and responds to text prompts about visual content. Takes an image and a text prompt as input, then gener...

🖼️ → 📝 • image-to-text • visual-understanding • image-analysis • 12.5K runs

🤖 Model 🖼️ → 📝

chenxwh/deepseek-vl2

Analyze images and answer questions about visual content using a Mixture-of-Experts vision-language model. Processes sin...

🖼️ → 📝 • image-to-text • ocr • visual-understanding • 1.1K runs

🤖 Model 🖼️ → 📝

deepseek-ai/janus-pro-7b

Answers questions about images through multimodal understanding. Takes an image and a text question as input and generat...

🖼️ → 📝 • image-to-text • text-generation • visual-understanding • 13.9K runs

🤖 Model 🖼️ → 📝

prunaai/gemma-4-26b-a4b-fast

Generates text responses from text, image, and video inputs using a multimodal reasoning model. Processes questions abou...

🖼️ → 📝 • text-generation • image-to-text • video-to-text • 14.1K runs