Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark of real GitHub issues requiring autonomous end-to-end software engineering to resolve.
SWE-bench (Princeton/Chicago 2023) contains 2,294 real bugs from 12 popular Python repositories - Devin (Cognition) achieves 13.86%, Claude 3.5 Sonnet achieves 49% with scaffolding, used to rank coding agents.
SWE-bench Verified (2024) filters to 500 human-validated issues where the test suite reliably catches the fix - used by enterprise buyers to compare autonomous coding agents on realistic software maintenance tasks.
Cosine AI and SWE-agent (Princeton) use SWE-bench as the primary training signal for their coding agents - fine-tuning on SWE-bench trajectories doubles pass rate compared to base LLM baselines.