Glossary term
Glossary term
Evaluation and Benchmarks
Commonsense inference benchmark requiring models to select the most plausible continuation of a scenario description.
HellaSwag (Zellers et al. 2019) achieves near-perfect human performance (95.6%) but was designed to fool previous models - GPT-4 achieves 95.3%, demonstrating that LLMs now match human commonsense reasoning.
HellaSwag is included in the Open LLM Leaderboard (Hugging Face) as one of 6 core benchmarks - used by the open-source community to compare Llama, Mistral, Phi, and Falcon variants.
Enterprise model selection teams use HellaSwag as a proxy for instruction following quality in conversational agents - models below 85% on HellaSwag typically show poor coherence in multi-turn customer service deployments.