Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark providing self-hosted functional replicas of real websites to evaluate autonomous web agents on multi-step realistic tasks.
WebArena (ICLR 2024 Oral) contains 812 tasks across four web domains (e-commerce, forums, project management, content editing). When published, GPT-4-based agents scored only 14% vs 78% for humans, establishing the performance gap for autonomous web agents.
OpAgent (2026) achieved a new state-of-the-art of 71.6% on WebArena by combining multi-task SFT, online agentic RL, and a modular multi-agent architecture fine-tuned on Qwen3-VL-32B.
ServiceNow uses WebArena-style evaluation in their browser-automation research. WorkArena, their extension of WebArena, evaluates enterprise software agents on 19,912 tasks across ServiceNow workflows with the open BrowserGym evaluation platform.