lucataco/interactiveomni-8b
Hold multi-turn, multimodal conversations grounded in images, audio, video, and text, returning answers as text and opti...
Found 150 models (showing 141-150)
Hold multi-turn, multimodal conversations grounded in images, audio, video, and text, returning answers as text and opti...
Extract text and tables from document images or PDFs. Accepts an image or a selected PDF page and returns structured tex...
Answer questions about images with grounded visual references. Takes an image and a natural-language prompt and returns...
Generate and reason over text and images from prompts or multi-turn chat, with configurable reasoning effort. Accepts te...
Generate descriptive prompts for text-to-image models from a single image. Outputs a CLIP Interrogator-style prompt stri...
Caption images. Takes an image as input and outputs a short natural-language description (image-to-text) using OpenCLIP...
Generate and reason over text from prompts, with optional image, audio, and video inputs. Produce answers, explanations,...
Caption images and answer visual questions from an input image and text prompt. Accept an image plus a question or instr...
Generate and reason over text and images for coding, professional knowledge work, and agentic workflows. Accepts a singl...
Caption images, videos, and audio; answer media-grounded questions; and localize referred objects via visual grounding....