Glossary term
Glossary term
Evaluation and Benchmarks
AI2 Reasoning Challenge benchmark testing grade-school science questions requiring genuine reasoning rather than memorisation.
ARC Challenge (Clark et al. 2018) includes 1,172 questions that simple lookup or co-occurrence methods fail - used to test whether models reason vs memorise, with GPT-4 achieving 96.3% and Llama 3.1 8B achieving 83.4%.
The Open LLM Leaderboard uses ARC-Challenge as one of its six standard benchmarks - used by developers comparing quantised models (GGUF, GPTQ) to verify that quantisation doesn't degrade reasoning.
ARC-AGI (François Chollet, 2024) extends ARC to visual analogy tasks that GPT-4 scores only 5% on - used as a hard ceiling benchmark to measure progress toward general reasoning beyond pattern matching.