cjwbw/unival ❓🖼️📝 → ❓

▶️ 1.0K runs 📅 Aug 2023 ⚙️ Cog 0.8.6 🔗 GitHub 📄 Paper ⚖️ License

audio-to-text image-to-text video-to-text

Performance

2.3sTypical run time

~58sCold start (first call)

1.0KTotal runs

About

Unified Model for Image, Video, Audio and Language Tasks

Example Output

Output

{"answer":null,"output":"https://replicate.delivery/pbxt/Yd2zn7zhfM1kcSEPox8xdt9tejoaGc8nRypYBp6yJc49cGZRA/out.png"}

Performance Metrics

2.32s Prediction Time

57.74s Total Time

All Input Parameters

{
  "task_type": "Visual Grounding",
  "input_image": "https://replicate.delivery/pbxt/JKlg8Ru5XgAnr03vzxqrP5V9ztpyfYuaNpUm0hVXs0XEKmu1/1.png",
  "instruction": "detached banana"
}

Input Parameters

task_type Default: Image Captioning: Choose a task.
input_audio Type: string: Input audio.
input_image Type: string: Input image.
input_video Type: string: Input video.
instruction Type: string: Provide question for the VQA task, region for Visual Grounding task, and instruction for General tasks. The default instruction for Captioning task is ‘What does the image/video/audio describe?’

Output Schema

Version Details

Version ID: 00a9af2b0889db2a73f5002b3a3000ba4eec8c5fed0cb4fd842a3fc75ab0e98f
Version Created: August 11, 2023

Run on Replicate →