video-to-text AI Models - Page 2

jd7h/edit-video-by-editing-text

Edit videos by editing the transcript. Input a video and either transcribe it to text, or supply a desired transcript to...

🎥 • video-editing • video-to-text • 814 runs

🤖 Model 🖼️ → 📝

google/gemini-2.5-flash

Generate text responses from text, image, video, and audio inputs with controllable reasoning depth. Supports up to 1 mi...

🖼️ → 📝 • text-generation • image-to-text • video-to-text • 6.8M runs

🤖 Model 🖼️ → 📝

lucataco/internvl3_5-30b

Analyze images or video and generate text captions, answers, and summaries. Accepts single or multiple images or a video...

🖼️ → 📝 • image-to-text • video-to-text • 63 runs

🤖 Model 🖼️ → 📝

nvidia/nemotron-nano-v2-12b-vl

Analyzes images and videos to answer questions, extract data, and provide detailed descriptions. Supports processing up...

🖼️ → 📝 • image-to-text • video-to-text • document-to-json • 988 runs

🤖 Model 🖼️ → 📝

lucataco/qwen3-vl-8b-instruct

Analyze images and videos to generate detailed text descriptions and answers to questions. Supports both image and video...

🖼️ → 📝 • image-to-text • video-to-text • ocr • 93.6K runs

🤖 Model 🎥

adidoes/whisperx-video-transcribe

Transcribe speech from online videos into timestamped text. Accepts a video URL (YouTube and other supported sites) and...

🎥 • speech-to-text • video-to-text • 19.6K runs

🤖 Model 🎥

hovevideo/stable-whisper

Transcribe audio or video to text. Accepts an audio or video input and returns a JSON transcript or ASS subtitles, lever...

🎥 • speech-to-text • video-to-text • video-auto-captioning • 173 runs

🤖 Model 🎥

turian/insanely-fast-whisper-with-video

Transcribe or translate speech from audio files and videos to text. Accept audio or video input and return a transcript...

🎥 • speech-to-text • video-to-text • speaker-diarization • 8.6M runs

🤖 Model 📝 → 🔊

lucataco/interactiveomni-8b

Processes multiple inputs simultaneously including images, audio, text, and video to generate coherent text and speech r...

📝 → 🔊 • text-generation • image-to-text • video-to-text • 86 runs

🤖 Model 🖼️ → 📝

google/gemini-3-pro

Generates text responses from prompts with advanced reasoning capabilities, supporting multimodal inputs including image...

🖼️ → 📝 • text-generation • image-to-text • code-generation • 1.2M runs

🤖 Model 🖼️ → 📝

cjwbw/unival

Caption images, videos, and audio; answer media-grounded questions; and localize referred objects via visual grounding....

🖼️ → 📝 • image-to-text • video-to-text • audio-to-text • 996 runs

🤖 Model 🖼️ → 📝

google/gemini-3-flash

Generates text responses from text prompts with support for multimodal inputs including images, videos, and audio. Combi...

🖼️ → 📝 • text-generation • image-to-text • video-to-text • 4.1M runs

🤖 Model 🖼️ → 📝

google/gemini-3.1-pro

Advanced multimodal language model that processes text, images, videos, and audio to generate text responses. Features t...

🖼️ → 📝 • text-generation • image-to-text • code-generation • 574.6K runs

🤖 Model 🖼️ → 📝

prunaai/gemma-4-26b-a4b-fast

Generates text responses from text, image, and video inputs using a multimodal reasoning model. Processes questions abou...

🖼️ → 📝 • text-generation • image-to-text • video-to-text • 14.1K runs

🤖 Model 🖼️ → 📝

prunaai/qwen-3.5-27b-fast

Generates text responses based on text, image, or video inputs with strong multimodal reasoning capabilities. Handles vi...

🖼️ → 📝 • text-generation • image-to-text • video-to-text • 155 runs

🤖 Model 🖼️ → 📝

prunaai/qwen-3.5-35b-a3b-fast

Generates text responses from text, image, and video inputs using a 35B-parameter multimodal reasoning model optimized b...

🖼️ → 📝 • text-generation • image-to-text • video-to-text • 57 runs

🤖 Model 🎥

bytedance/sa2va-26b-video

Analyze videos with text instructions to provide question answering, visual understanding, and dense object segmentation...

🎥 • video-to-text • video-segmentation • video-object-detection • 763 runs