Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark for General AI Assistants featuring conceptually simple real-world questions requiring reasoning, web browsing, multi-modality, and tool use.
GAIA (ICLR 2024) contains 466 questions where humans score 92% but GPT-4 with plugins scored only 15% at launch. The questions are simple for humans (requiring a few steps) but hard for AI because they demand correct tool use and multi-step reasoning.
By 2025, the top agents on the GAIA leaderboard reached approximately 75%, still well below the 92% human baseline. Claude Sonnet 4.5 inside a well-engineered agent framework ranks among the strongest GAIA performers according to the awesomeagents.ai leaderboard (May 2026).
Enterprise AI teams use GAIA Level 1 tasks (single-step, web-browsable questions) as a quick sanity check for new agent deployments, requiring only basic tool use, while Level 3 tasks (multi-step, multi-tool) serve as aspirational capability targets.