jimothyjohn/phi3-vision-instruct 🖼️📝✓🔢 → 📝
About
A soon-to-be accelerated endpoint for multi-modal inference.

Example Output
Prompt:
"How many people are in this image/"
Output
There are two people in the image.
Performance Metrics
1.87s
Prediction Time
170.53s
Total Time
All Input Parameters
{ "prompt": "How many people are in this image/", "do_sample": true, "image_urls": "https://github.com/JimothyJohn/cerebro/blob/master/data/images/zidane.jpg?raw=true", "temperature": 0.7, "max_new_tokens": 1000 }
Input Parameters
- image (required)
- Image path, URL, or file path
- prompt (required)
- Input prompt
- do_sample
- Whether or not to use sampling; use greedy decoding otherwise.
- temperature
- Temperature for generation
- max_new_tokens
- Max new tokens
Output Schema
Output
Example Execution Logs
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Version Details
- Version ID
ffcebb3e1c91ddc7eb66216dc604bc6cb843a419829364a8a407a99335450205
- Version Created
- October 2, 2024