Summarize

OpenAI releases new model to advance AI agent development

By Mudit Dube

Jan 24, 2025

09:43 am

What's the story

OpenAI has unveiled a research preview of Operator, an AI agent that can perform web-based tasks. The technology behind Operator is Computer-Using Agent (CUA), a model that combines GPT-4o's vision capabilities with advanced reasoning through reinforcement learning. CUA is designed to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields we see on a screen—in a human-like manner. This allows it to perform digital tasks without relying on operating system (OS) or web-specific APIs.

Technological breakthrough

CUA's advanced capabilities and performance

CUA marks a major leap in AI, being able to comprehend and interact with GUIs just like humans. It can decompose tasks into multi-step plans and adaptively self-correct when things get difficult. This tech has already set new benchmark results, achieving 38.1% success on OSWorld for full computer use tasks and 58.1% on WebArena and 87% on WebVoyager for web-based tasks.

Functionality and security

CUA's operational process and safety measures

CUA works by analyzing raw pixel data to comprehend what is happening on the screen, employing a virtual mouse and keyboard to perform actions. It can follow multi-step tasks, deal with errors, and adjust to unforeseen changes. The model works in an iterative loop, combining perception, reasoning, and action. OpenAI has also focused on safety in CUA's development to mitigate risks of an AI agent entering the digital world.

Benchmark results

CUA's performance evaluation and future improvements

CUA has established a new benchmark in both computer use and browser use benchmarks, utilizing the same universal interface of screen, mouse, and keyboard. It registered a 58.1% success rate on WebArena and an 87% success rate on WebVoyager for web-based tasks. However, despite its high success rate on simpler tasks like those in WebVoyager, CUA still requires more improvements to match human performance on more complex benchmarks like WebArena.