adirik/vila-7b 🖼️🔢📝 → 📝
About
[Non-commerical] A multi-image visual language model

Example Output
Prompt:
"Can you describe this image?"
Output
The image captures a close-up view of a hand, which is the central focus. The hand is open, with the fingers spread out in a relaxed manner. The palm of the hand is facing upwards, revealing a vibrant display of colors. The colors are a mix of rainbow hues, with red, orange, yellow, green, blue, and purple being the most prominent. These colors are not just confined to the palm but are also visible on the fingers, suggesting that the hand has been dipped in a colorful substance. The background is a plain white color, which contrasts with the colorful hand and makes it stand out. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any other objects to count or compare sizes with. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not provide any information about the actions of the objects, as it is a still image. The image does not contain any text or other discernible objects. The relative position of the hand to the background suggests that the hand is in the foreground, while the background is in the background. The image does not
Performance Metrics
16.68s
Prediction Time
279.85s
Total Time
All Input Parameters
{ "image": "https://replicate.delivery/pbxt/KYxJahoKC98m14tPVSk6dmrGQxT3aI54QMlfN4b9xgXlG7jM/3.jpg", "top_p": 1, "prompt": "Can you describe this image?", "num_beams": 1, "max_tokens": 512, "temperature": 0.2 }
Input Parameters
- image (required)
- Image 1 to discuss
- top_p
- When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
- prompt (required)
- Query to ask the model about the image
- num_beams
- Number of beams to use when decoding text; higher values are slower but more accurate
- max_tokens
- Maximum number of tokens to generate
- temperature
- When decoding text, higher values make the model more creative
Output Schema
Output
Version Details
- Version ID
1ab820cd1a0a026c19ab0cc2fd178cdb5a588ea0b67ff1431ee84f74d6d40a3e
- Version Created
- March 13, 2024