adirik/vila-2.7b 🖼️🔢📝 → 📝

▶️ 2.6K runs 📅 Mar 2024 ⚙️ Cog 0.8.3 🔗 GitHub 📄 Paper ⚖️ License

image-analysis image-captioning image-to-text visual-question-answering

About

[Non-commerical] A multi-image visual language model

Example Output

Prompt:

"Can you describe this image?"

Output

The image captures a moment of creativity and playfulness. A person's hand, adorned with a rainbow of colors, is the central focus. The hand is painted with a vibrant array of colors, including hues of red, orange, yellow, green, blue, and purple. The colors are not just confined to the hand, but also extend to the person's wrist, creating a striking contrast against the white background. The image is taken from a low angle, adding a sense of grandeur to the hand and its colors. The overall composition of the image suggests a celebration of individuality and the joy of self-expression through art.

Performance Metrics

5.64s Prediction Time

85.54s Total Time

All Input Parameters

{
  "image": "https://replicate.delivery/pbxt/KYxCJBjNTIOz189qT2R55a8otIFVM1igj8jiDcO8qMx39WXV/3.jpg",
  "top_p": 1,
  "prompt": "Can you describe this image?",
  "num_beams": 1,
  "max_tokens": 512,
  "temperature": 0.2
}

Input Parameters

image (required) Type: string: Image 1 to discuss
top_p Type: numberDefault: 1Range: 0 - 1: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
prompt (required) Type: string: Query to ask the model about the image
num_beams Type: integerDefault: 1Range: 1 - 5: Number of beams to use when decoding text; higher values are slower but more accurate
max_tokens Type: integerDefault: 512Range: 1 - ∞: Maximum number of tokens to generate
temperature Type: numberDefault: 0.2Range: 0 - ∞: When decoding text, higher values make the model more creative

Output Schema

Output

Type: string

Version Details

Version ID: f8a85ee8d0ab7c84433c5f6c00855142edf4783b52a1b66626d47ef260028750
Version Created: March 13, 2024

Run on Replicate →