
chenxwh/cogvlm2-video
Generate text descriptions and answers from a video input. Accepts a video and an optional prompt to perform video capti...
Found 21 models (showing 1-20)
Generate text descriptions and answers from a video input. Accepts a video and an optional prompt to perform video capti...
Answer questions and generate detailed descriptions from a video input. Provide a video and a text prompt to get caption...
Answer questions about videos and generate detailed captions from a video input. Accepts a video and a natural-language...
Generate text descriptions and answers from a video input. Accepts a video and a natural-language prompt to perform vide...
Caption videos and answer open-ended questions about their content. Accept one or more video inputs plus a list of natur...
Caption images and long videos and answer visual questions, returning text. Accepts an image or video plus an instructio...
Generate captions, summaries, and Q&A from a video input. Accepts a video and an instruction prompt and returns a single...
Analyze videos and generate text descriptions, answers, and summaries from a prompt. Accepts a video and an instruction,...
Caption videos. Provide a video and an optional instruction prompt to produce a single text output for captioning, summa...
Answer questions about images and videos. Accepts an image or a video plus a question and returns text, enabling visual...
Generate captions and answer visual questions for images and videos from a text prompt. Accepts a single image or a vide...
Transcribe speech to text from audio or video inputs. Auto-detect language or specify one, and optionally translate the...
Answer questions about video content in a multi-turn chat. Take a video and a chat message history as input and return a...
Caption images and videos. Accepts an image or video plus an optional prompt and returns text descriptions or summaries...
Generate captions, answers, and summaries from an input image or video. Accept an image or video plus an optional prompt...
Generate text descriptions for images and videos. Accepts a single image or video plus an optional instruction prompt, a...
Caption and answer questions about videos. Takes a video and a text prompt and returns text, enabling detailed descripti...
Assess video quality from a video input. Return JSON text with numeric scores for aesthetic appeal, technical quality, a...
Chat with a multimodal assistant that understands text, images, audio, and video inputs and returns text plus optional s...
Transcribe spoken words from silent video using visual speech recognition (lip reading). Input a short clip (2–40 second...