Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark evaluating LLMs as agents across 8 distinct environments including operating systems, databases, web browsers, and games.
AgentBench (Liu et al. 2023) covers OS shell tasks, database management, web shopping, web browsing, card games, lateral thinking puzzles, and household tasks. GPT-4 scores 3.0/5.0 overall, while Llama 2 70B scores 0.77, showing the large gap between proprietary and open-source agents.
The OS environment in AgentBench tests bash script execution, file management, and process control - used by researchers to identify that most open-source LLMs below 70B completely fail at multi-step OS automation tasks.
AgentBench's interleaved task design (multi-turn environment interaction) is used by agent framework developers to stress-test state management, error recovery, and long-horizon planning in tools like LangChain and AutoGen.