adirik/bunny-phi-2-siglip 🖼️🔢📝 → 📝

▶️ 7.9K runs 📅 Feb 2024 ⚙️ Cog 0.8.3 🔗 GitHub 📄 Paper ⚖️ License

image-captioning image-to-text visual-question-answering visual-understanding

About

Lightweight multimodal model for visual question answering, reasoning and captioning

Example Output

Prompt:

"What is the astronaut holding in his hand?"

Output

The astronaut is holding a green beer bottle in his hand.

Performance Metrics

3.22s Prediction Time

331.37s Total Time

All Input Parameters

{
  "image": "https://replicate.delivery/pbxt/KTFhbC8R9gUwjn6ETus3lIYjaTLR5zTxmxwWD5iPwRuakm75/example_1.png",
  "top_p": 0.7,
  "prompt": "What is the astronaut holding in his hand?",
  "temperature": 0.2,
  "max_new_tokens": 100
}

Input Parameters

image (required) Type: string: Input image
top_p Type: numberDefault: 0.7Range: 0 - 1: Samples from top_p percentage of most likely tokens during decoding
prompt (required) Type: string: Input text / prompt to query the image
temperature Type: numberDefault: 0.2Range: 0 - 1: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic
max_new_tokens Type: integerDefault: 100Range: 1 - ∞: Maximum number of tokens to generate. A word is generally 2-3 tokens

Output Schema

Output

Type: string

Version Details

Version ID: b14895cb3d551c902cb163e93a6a96b4bb9482e2753daacbc6b7003c42d6644e
Version Created: February 26, 2024

Run on Replicate →