Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark evaluating desktop computer-use agents on realistic cross-application tasks across Windows, macOS, and Ubuntu environments.
OSWorld (2024) contains 369 tasks across files, browsers, spreadsheets, code editors, and multi-app workflows. Claude 3.5 Sonnet's computer use achieved 14.9% task success vs 72.4% for humans, establishing the baseline for desktop agent evaluation.
Microsoft's UFO agent, designed for Windows desktop automation, uses OSWorld-style evaluation across Word, Excel, PowerPoint, and Edge tasks. UFO uses GPT-4V to interpret screenshots and UI-Automation API to interact with application elements.
Adept AI evaluates its ACT-1 agent on OSWorld-equivalent enterprise software tasks (Salesforce, Workday, SAP) as part of its commercial product validation, demonstrating the benchmark's relevance beyond academic research.