SWE-Bench
SWE-Bench is a software engineering evaluation dataset introduced by Princeton researchers in 2023, containing over 2,000 real GitHub issues paired with the repository state and the ground-truth patch that resolved each issue; Each SWE-Bench instance supplies the repository at a specific commit, the issue description, and the set of test cases that the correct fix must pass; SWE-Bench is a more realistic signal for AI coding assistant capability than HumanEval for teams building engineering automation products
SWE-Bench is a software engineering evaluation dataset introduced by Princeton researchers in 2023, containing over 2,000 real GitHub issues paired with the repository state and the ground-truth patch that resolved each issue. A model is evaluated by whether it can autonomously generate a patch that causes the relevant test suite to pass, reflecting realistic repository-level coding rather than isolated function generation.
How it works
Each SWE-Bench instance supplies the repository at a specific commit, the issue description, and the set of test cases that the correct fix must pass. The model, often running as an agent with file-reading and editing tools, must identify the relevant code, understand the bug, and produce a correct patch. Evaluation runs the patched code against the test suite and reports the percentage of issues resolved.
Key facts
- SWE-Bench Verified: A human-filtered subset of 500 problems with validated test coverage, used as the primary leaderboard.
- Top scores (2025): Leading models and agent scaffolds resolve 40 to 55 percent of SWE-Bench Verified problems.
- Difficulty: Problems require understanding codebases of thousands of lines, cross-file edits, and multi-step reasoning.
- Agent-required: High scores require agentic scaffolding with tool access; base model completions score near zero.
For builders
SWE-Bench is a more realistic signal for AI coding assistant capability than HumanEval for teams building engineering automation products. Monitoring frontier model score trends on SWE-Bench provides a leading indicator for when autonomous code review, bug triage, or PR generation features become viable at the quality bar a product requires.
Sources
- Jimenez, C., et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770. arxiv.org
- Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code (HumanEval). arXiv:2107.03374. arxiv.org
- Hendrycks, D., et al. (2020). Measuring Massive Multitask Language Understanding (MMLU). arXiv:2009.03300. arxiv.org
- Stanford CRFM. Holistic Evaluation of Language Models (HELM). crfm.stanford.edu
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. arxiv.org