cjwbw/cogagent-chat 🖼️📝🔢 → 📝

▶️ 2.3K runs 📅 Feb 2024 ⚙️ Cog 0.8.6 🔗 GitHub 📄 Paper ⚖️ License
gui-agent image-to-text ocr visual-understanding

About

A Visual Language Model for GUI Agents

Example Output

Output

The screenshot shows a web page from Skyscanner for booking flights from London (Any) to Bangkok (Any) for 1 adult in Economy class. The page displays options to sort and filter results, with a total of 1009 results shown. Stops are listed on the left with options for direct and 1 stop flights from different airlines. On the right, flight options are presented with details such as departure and arrival times, duration, airlines, whether they have a stopover, and prices in both pounds sterling and Thai baht. The cheapest option is highlighted for £480, with the fastest being £855. Outbound flight options are shown on the left, with return options on the right. User controls for sorting and selecting flights are also visible.

Performance Metrics

20.13s Prediction Time
283.75s Total Time
All Input Parameters
{
  "image": "https://replicate.delivery/pbxt/KLeBpmZL2GjRa2c77grUodPILFbYdY8re3AzfuoBmQ3rEH29/Screenshot%202024-02-05%20at%2001.54.19.png",
  "query": "what does this screenshot tell you?",
  "temperature": 0.9
}
Input Parameters
image (required) Type: string
Input image.
query Type: stringDefault: Describe this image.
Input query.
temperature Type: numberDefault: 0.9
Adjusts randomness of textual outputs, greater than 1 is random and 0 is deterministic.
Output Schema

Output

Type: string

Version Details
Version ID
c3429fc8f69bc71d0ee109e99c62b1e223b7cdfc0839b8e44a4326e12752656e
Version Created
February 5, 2024
Run on Replicate →