jimothyjohn/phi3-vision-instruct 🖼️📝✓🔢 → 📝

▶️ 208 runs 📅 Sep 2024 ⚙️ Cog 0.9.24 🔗 GitHub 📄 Paper ⚖️ License

image-captioning image-to-text visual-question-answering visual-understanding

About

A soon-to-be accelerated endpoint for multi-modal inference.

Example Output

Prompt:

"How many people are in this image/"

Output

There are two people in the image.

Performance Metrics

1.87s Prediction Time

170.53s Total Time

All Input Parameters

{
  "prompt": "How many people are in this image/",
  "do_sample": true,
  "image_urls": "https://github.com/JimothyJohn/cerebro/blob/master/data/images/zidane.jpg?raw=true",
  "temperature": 0.7,
  "max_new_tokens": 1000
}

Input Parameters

image (required) Type: string: Image path, URL, or file path
prompt (required) Type: string: Input prompt
do_sample Type: booleanDefault: true: Whether or not to use sampling; use greedy decoding otherwise.
temperature Type: numberDefault: 0.7Range: 0 - 1: Temperature for generation
max_new_tokens Type: integerDefault: 1024Range: 64 - 2048: Max new tokens

Output Schema

Output

Type: string

Example Execution Logs

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.

Version Details

Version ID: ffcebb3e1c91ddc7eb66216dc604bc6cb843a419829364a8a407a99335450205
Version Created: October 2, 2024

Run on Replicate →