cjwbw/unival β“πŸ–ΌοΈπŸ“ β†’ ❓

▢️ 996 runs πŸ“… Aug 2023 βš™οΈ Cog 0.8.6 πŸ”— GitHub πŸ“„ Paper βš–οΈ License
audio-to-text image-to-text video-to-text

About

Unified Model for Image, Video, Audio and Language Tasks

Example Output

Output

{"answer":null,"output":"https://replicate.delivery/pbxt/Yd2zn7zhfM1kcSEPox8xdt9tejoaGc8nRypYBp6yJc49cGZRA/out.png"}

Performance Metrics

2.32s Prediction Time
57.74s Total Time
All Input Parameters
{
  "task_type": "Visual Grounding",
  "input_image": "https://replicate.delivery/pbxt/JKlg8Ru5XgAnr03vzxqrP5V9ztpyfYuaNpUm0hVXs0XEKmu1/1.png",
  "instruction": "detached banana"
}
Input Parameters
task_type Default: Image Captioning
Choose a task.
input_audio Type: string
Input audio.
input_image Type: string
Input image.
input_video Type: string
Input video.
instruction Type: string
Provide question for the VQA task, region for Visual Grounding task, and instruction for General tasks. The default instruction for Captioning task is β€˜What does the image/video/audio describe?’
Output Schema
Version Details
Version ID
00a9af2b0889db2a73f5002b3a3000ba4eec8c5fed0cb4fd842a3fc75ab0e98f
Version Created
August 11, 2023
Run on Replicate β†’