SWE-Bench

Benchmark of real GitHub issues requiring autonomous end-to-end software engineering to resolve.

1.
SWE-bench (Princeton/Chicago 2023) contains 2,294 real bugs from 12 popular Python repositories - Devin (Cognition) achieves 13.86%, Claude 3.5 Sonnet achieves 49% with scaffolding, used to rank coding agents.
2.
SWE-bench Verified (2024) filters to 500 human-validated issues where the test suite reliably catches the fix - used by enterprise buyers to compare autonomous coding agents on realistic software maintenance tasks.
3.
Cosine AI and SWE-agent (Princeton) use SWE-bench as the primary training signal for their coding agents - fine-tuning on SWE-bench trajectories doubles pass rate compared to base LLM baselines.

Loading…