jd7h/edit-video-by-editing-text
Edit videos by editing the transcript. Input a video and either transcribe it to text, or supply a desired transcript to...
Found 36 models (showing 21-36)
Edit videos by editing the transcript. Input a video and either transcribe it to text, or supply a desired transcript to...
Generate text responses from text, image, video, and audio inputs with controllable reasoning depth. Supports up to 1 mi...
Analyze images or video and generate text captions, answers, and summaries. Accepts single or multiple images or a video...
Extract structured data, answer visual questions, and summarize videos from images and videos. Accepts 1–4 images or a v...
Caption images and videos and answer visual questions. Accepts an optional image or video plus a text prompt and returns...
Transcribe speech from online videos into timestamped text. Accepts a video URL (YouTube and other supported sites) and...
Transcribe audio or video to text. Accepts an audio or video input and returns a JSON transcript or ASS subtitles, lever...
Transcribe or translate speech from audio files and videos to text. Accept audio or video input and return a transcript...
Processes multiple inputs simultaneously including images, audio, text, and video to generate coherent text and speech r...
Generates text responses from prompts with advanced reasoning capabilities, supporting multimodal inputs including image...
Caption images, videos, and audio; answer media-grounded questions; and localize referred objects via visual grounding....
Generate and analyze text from prompts, images, audio, and video with fast, low-latency responses. Accept text plus up t...
Advanced multimodal language model that processes text, images, videos, and audio to generate text responses. Features t...
Generates text responses from text, image, and video inputs using a multimodal reasoning model. Processes questions abou...
Generates text responses based on text, image, or video inputs with strong multimodal reasoning capabilities. Handles vi...
Generates text responses from text, image, and video inputs using a 35B-parameter multimodal reasoning model optimized b...