jimothyjohn/phi3-vision-instruct 🖼️📝✓🔢 → 📝

▶️ 201 runs 📅 Sep 2024 ⚙️ Cog 0.9.24 🔗 GitHub 📄 Paper ⚖️ License
image-captioning image-to-text visual-question-answering

About

A soon-to-be accelerated endpoint for multi-modal inference.

Example Output

Prompt:

"How many people are in this image/"

Output

There are two people in the image.

Performance Metrics

1.87s Prediction Time
170.53s Total Time
All Input Parameters
{
  "prompt": "How many people are in this image/",
  "do_sample": true,
  "image_urls": "https://github.com/JimothyJohn/cerebro/blob/master/data/images/zidane.jpg?raw=true",
  "temperature": 0.7,
  "max_new_tokens": 1000
}
Input Parameters
image (required) Type: string
Image path, URL, or file path
prompt (required) Type: string
Input prompt
do_sample Type: booleanDefault: true
Whether or not to use sampling; use greedy decoding otherwise.
temperature Type: numberDefault: 0.7Range: 0 - 1
Temperature for generation
max_new_tokens Type: integerDefault: 1024Range: 64 - 2048
Max new tokens
Output Schema

Output

Type: string

Example Execution Logs
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Version Details
Version ID
ffcebb3e1c91ddc7eb66216dc604bc6cb843a419829364a8a407a99335450205
Version Created
October 2, 2024
Run on Replicate →