adirik/vila-2.7b 🖼️🔢📝 → 📝

▶️ 2.6K runs 📅 Mar 2024 ⚙️ Cog 0.8.3 🔗 GitHub 📄 Paper ⚖️ License
image-analysis image-captioning image-to-text visual-question-answering

About

[Non-commerical] A multi-image visual language model

Example Output

Prompt:

"Can you describe this image?"

Output

The image captures a moment of creativity and playfulness. A person's hand, adorned with a rainbow of colors, is the central focus. The hand is painted with a vibrant array of colors, including hues of red, orange, yellow, green, blue, and purple. The colors are not just confined to the hand, but also extend to the person's wrist, creating a striking contrast against the white background. The image is taken from a low angle, adding a sense of grandeur to the hand and its colors. The overall composition of the image suggests a celebration of individuality and the joy of self-expression through art.

Performance Metrics

5.64s Prediction Time
85.54s Total Time
All Input Parameters
{
  "image": "https://replicate.delivery/pbxt/KYxCJBjNTIOz189qT2R55a8otIFVM1igj8jiDcO8qMx39WXV/3.jpg",
  "top_p": 1,
  "prompt": "Can you describe this image?",
  "num_beams": 1,
  "max_tokens": 512,
  "temperature": 0.2
}
Input Parameters
image (required) Type: string
Image 1 to discuss
top_p Type: numberDefault: 1Range: 0 - 1
When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
prompt (required) Type: string
Query to ask the model about the image
num_beams Type: integerDefault: 1Range: 1 - 5
Number of beams to use when decoding text; higher values are slower but more accurate
max_tokens Type: integerDefault: 512Range: 1 - ∞
Maximum number of tokens to generate
temperature Type: numberDefault: 0.2Range: 0 - ∞
When decoding text, higher values make the model more creative
Output Schema

Output

Type: string

Version Details
Version ID
f8a85ee8d0ab7c84433c5f6c00855142edf4783b52a1b66626d47ef260028750
Version Created
March 13, 2024
Run on Replicate →