Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark evaluating LLM agents on consequential real-world professional tasks in a simulated software company environment.
TheAgentCompany (WebArena team, 2024) contains 175 tasks modelling realistic work at a software company including project management (Plane), code review (Gitlab), HR (RocketChat), and financial tasks. The best agent (Claude 3.5 Sonnet) completes only 24% of tasks.
TheAgentCompany tasks require agents to navigate multiple internal tools simultaneously, delegate sub-tasks, and manage multi-day workflows - reflecting the realistic complexity of agentic deployments that single-tool benchmarks do not capture.
Enterprise AI teams use TheAgentCompany's task taxonomy to scope their own agent pilots, mapping its HR, engineering, and finance task categories to internal processes to identify which workflows are ready for agentic automation.