adirik/bunny-phi-2-siglip 🖼️🔢📝 → 📝
About
Lightweight multimodal model for visual question answering, reasoning and captioning

Example Output
Prompt:
"What is the astronaut holding in his hand?"
Output
The astronaut is holding a green beer bottle in his hand.
Performance Metrics
3.22s
Prediction Time
331.37s
Total Time
All Input Parameters
{ "image": "https://replicate.delivery/pbxt/KTFhbC8R9gUwjn6ETus3lIYjaTLR5zTxmxwWD5iPwRuakm75/example_1.png", "top_p": 0.7, "prompt": "What is the astronaut holding in his hand?", "temperature": 0.2, "max_new_tokens": 100 }
Input Parameters
- image (required)
- Input image
- top_p
- Samples from top_p percentage of most likely tokens during decoding
- prompt (required)
- Input text / prompt to query the image
- temperature
- Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic
- max_new_tokens
- Maximum number of tokens to generate. A word is generally 2-3 tokens
Output Schema
Output
Version Details
- Version ID
b14895cb3d551c902cb163e93a6a96b4bb9482e2753daacbc6b7003c42d6644e
- Version Created
- February 26, 2024