adirik/vila-7b 🖼️🔢📝 → 📝

▶️ 2.2K runs 📅 Mar 2024 ⚙️ Cog 0.8.3 🔗 GitHub 📄 Paper ⚖️ License
image-captioning image-to-text visual-question-answering

About

[Non-commerical] A multi-image visual language model

Example Output

Prompt:

"Can you describe this image?"

Output

The image captures a close-up view of a hand, which is the central focus. The hand is open, with the fingers spread out in a relaxed manner. The palm of the hand is facing upwards, revealing a vibrant display of colors. The colors are a mix of rainbow hues, with red, orange, yellow, green, blue, and purple being the most prominent. These colors are not just confined to the palm but are also visible on the fingers, suggesting that the hand has been dipped in a colorful substance. The background is a plain white color, which contrasts with the colorful hand and makes it stand out. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any other objects to count or compare sizes with. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not

Performance Metrics

16.68s Prediction Time
279.85s Total Time
All Input Parameters
{
  "image": "https://replicate.delivery/pbxt/KYxJahoKC98m14tPVSk6dmrGQxT3aI54QMlfN4b9xgXlG7jM/3.jpg",
  "top_p": 1,
  "prompt": "Can you describe this image?",
  "num_beams": 1,
  "max_tokens": 512,
  "temperature": 0.2
}
Input Parameters
image (required) Type: string
Image 1 to discuss
top_p Type: numberDefault: 1Range: 0 - 1
When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
prompt (required) Type: string
Query to ask the model about the image
num_beams Type: integerDefault: 1Range: 1 - 5
Number of beams to use when decoding text; higher values are slower but more accurate
max_tokens Type: integerDefault: 512Range: 1 - ∞
Maximum number of tokens to generate
temperature Type: numberDefault: 0.2Range: 0 - ∞
When decoding text, higher values make the model more creative
Output Schema

Output

Type: string

Version Details
Version ID
1ab820cd1a0a026c19ab0cc2fd178cdb5a588ea0b67ff1431ee84f74d6d40a3e
Version Created
March 13, 2024
Run on Replicate →