adirik/bunny-phi-2-siglip 🖼️🔢📝 → 📝

▶️ 7.8K runs 📅 Feb 2024 ⚙️ Cog 0.8.3 🔗 GitHub 📄 Paper ⚖️ License
image-captioning image-to-text visual-question-answering

About

Lightweight multimodal model for visual question answering, reasoning and captioning

Example Output

Prompt:

"What is the astronaut holding in his hand?"

Output

The astronaut is holding a green beer bottle in his hand.

Performance Metrics

3.22s Prediction Time
331.37s Total Time
All Input Parameters
{
  "image": "https://replicate.delivery/pbxt/KTFhbC8R9gUwjn6ETus3lIYjaTLR5zTxmxwWD5iPwRuakm75/example_1.png",
  "top_p": 0.7,
  "prompt": "What is the astronaut holding in his hand?",
  "temperature": 0.2,
  "max_new_tokens": 100
}
Input Parameters
image (required) Type: string
Input image
top_p Type: numberDefault: 0.7Range: 0 - 1
Samples from top_p percentage of most likely tokens during decoding
prompt (required) Type: string
Input text / prompt to query the image
temperature Type: numberDefault: 0.2Range: 0 - 1
Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic
max_new_tokens Type: integerDefault: 100Range: 1 - ∞
Maximum number of tokens to generate. A word is generally 2-3 tokens
Output Schema

Output

Type: string

Version Details
Version ID
b14895cb3d551c902cb163e93a6a96b4bb9482e2753daacbc6b7003c42d6644e
Version Created
February 26, 2024
Run on Replicate →