Skip to content
Article Issue #5191

SWE-Bench

What to know

SWE-Bench is a software engineering evaluation dataset introduced by Princeton researchers in 2023, containing over 2,000 real GitHub issues paired with the repository state and the ground-truth patch that resolved each issue; Each SWE-Bench instance supplies the repository at a specific commit, the issue description, and the set of test cases that the correct fix must pass; SWE-Bench is a more realistic signal for AI coding assistant capability than HumanEval for teams building engineering automation products

SWE-Bench, WikiWalls Glossary illustration

« Back to Glossary Index

SWE-Bench is a software engineering evaluation dataset introduced by Princeton researchers in 2023, containing over 2,000 real GitHub issues paired with the repository state and the ground-truth patch that resolved each issue. A model is evaluated by whether it can autonomously generate a patch that causes the relevant test suite to pass, reflecting realistic repository-level coding rather than isolated function generation.

How it works

Each SWE-Bench instance supplies the repository at a specific commit, the issue description, and the set of test cases that the correct fix must pass. The model, often running as an agent with file-reading and editing tools, must identify the relevant code, understand the bug, and produce a correct patch. Evaluation runs the patched code against the test suite and reports the percentage of issues resolved.

Key facts

  • SWE-Bench Verified: A human-filtered subset of 500 problems with validated test coverage, used as the primary leaderboard.
  • Top scores (2025): Leading models and agent scaffolds resolve 40 to 55 percent of SWE-Bench Verified problems.
  • Difficulty: Problems require understanding codebases of thousands of lines, cross-file edits, and multi-step reasoning.
  • Agent-required: High scores require agentic scaffolding with tool access; base model completions score near zero.

For builders

SWE-Bench is a more realistic signal for AI coding assistant capability than HumanEval for teams building engineering automation products. Monitoring frontier model score trends on SWE-Bench provides a leading indicator for when autonomous code review, bug triage, or PR generation features become viable at the quality bar a product requires.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.