adirik/vila-7b 🖼️🔢📝 → 📝

▶️ 7.4K runs 📅 Mar 2024 ⚙️ Cog 0.8.3 🔗 GitHub 📄 Paper ⚖️ License

image-captioning image-to-text question-answering visual-question-answering visual-understanding

Performance

16.7sTypical run time

~280sCold start (first call)

7.4KTotal runs

About

[Non-commerical] A multi-image visual language model

Example Output

Prompt:

"Can you describe this image?"

Output

The image captures a close-up view of a hand, which is the central focus. The hand is open, with the fingers spread out in a relaxed manner. The palm of the hand is facing upwards, revealing a vibrant display of colors. The colors are a mix of rainbow hues, with red, orange, yellow, green, blue, and purple being the most prominent. These colors are not just confined to the palm but are also visible on the fingers, suggesting that the hand has been dipped in a colorful substance. The background is a plain white color, which contrasts with the colorful hand and makes it stand out. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any other objects to count or compare sizes with. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not

Performance Metrics

16.68s Prediction Time

279.85s Total Time

All Input Parameters

{
  "image": "https://replicate.delivery/pbxt/KYxJahoKC98m14tPVSk6dmrGQxT3aI54QMlfN4b9xgXlG7jM/3.jpg",
  "top_p": 1,
  "prompt": "Can you describe this image?",
  "num_beams": 1,
  "max_tokens": 512,
  "temperature": 0.2
}

Input Parameters

image (required) Type: string: Image 1 to discuss
top_p Type: numberDefault: 1Range: 0 - 1: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
prompt (required) Type: string: Query to ask the model about the image
num_beams Type: integerDefault: 1Range: 1 - 5: Number of beams to use when decoding text; higher values are slower but more accurate
max_tokens Type: integerDefault: 512Range: 1 - ∞: Maximum number of tokens to generate
temperature Type: numberDefault: 0.2Range: 0 - ∞: When decoding text, higher values make the model more creative

Output Schema

Output

Type: string

Version Details

Version ID: 1ab820cd1a0a026c19ab0cc2fd178cdb5a588ea0b67ff1431ee84f74d6d40a3e
Version Created: March 13, 2024

Run on Replicate →