cjwbw/pix2struct 📝🖼️❓ → 📝

▶️ 6.1K runs 📅 Mar 2023 ⚙️ Cog 0.6.1 🔗 GitHub 📄 Paper

image-to-text ocr visual-question-answering

About

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Example Output

Output

ash cloud

Performance Metrics

2.62s Prediction Time

2.72s Total Time

All Input Parameters

{
  "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud",
  "image": "https://replicate.delivery/pbxt/IYTqGFhESyTxWX9AgFABgMdfStkVTkDZrNGCqhS6VCijFLgj/ai2d-demo.jpg",
  "model_name": "ai2d"
}

Input Parameters

text (required) Type: string: Input text.
image (required) Type: string: Input Image.
model_name Default: screen2words: Choose a model.

Output Schema

Output

Type: string

Example Execution Logs

Adding prompt for VQA model: 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Version Details

Version ID: e32d77481424b47e7959836638b62082d8528b0c66a3a30eedca3970aaf786e7
Version Created: March 28, 2023

Run on Replicate →