cjwbw/pix2struct 📝🖼️❓ → 📝

▶️ 6.1K runs 📅 Mar 2023 ⚙️ Cog 0.6.1 🔗 GitHub 📄 Paper
image-to-text ocr visual-question-answering

About

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Example Output

Output

ash cloud

Performance Metrics

2.62s Prediction Time
2.72s Total Time
All Input Parameters
{
  "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud",
  "image": "https://replicate.delivery/pbxt/IYTqGFhESyTxWX9AgFABgMdfStkVTkDZrNGCqhS6VCijFLgj/ai2d-demo.jpg",
  "model_name": "ai2d"
}
Input Parameters
text (required) Type: string
Input text.
image (required) Type: string
Input Image.
model_name Default: screen2words
Choose a model.
Output Schema

Output

Type: string

Example Execution Logs
Adding prompt for VQA model: 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Version Details
Version ID
e32d77481424b47e7959836638b62082d8528b0c66a3a30eedca3970aaf786e7
Version Created
March 28, 2023
Run on Replicate →