Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark of 164 Python programming problems used to evaluate code generation capability.
HumanEval (Chen et al. 2021, OpenAI) measures pass@1 for code generation - GPT-4o achieves 90.2% pass@1, vs 67.0% for GPT-3.5, used by GitHub Copilot team to justify model upgrades.
Devin (Cognition AI) achieves 13.86% on SWE-bench (harder version of HumanEval with real GitHub issues) - the benchmark is used by enterprise buyers to compare autonomous coding agent products.
HumanEval+ (EvalPlus) extends the original benchmark with 80x more test cases per problem to reduce false positives - used by research teams to measure true code correctness rather than passing superficial tests.