SWE-bench: Evaluating AI on Real-World Software Engineering

Benchmarking

Agentic coding

LLM

GenAI

Author

Chris Endemann

Published

February 27, 2026

SWE-bench is a benchmark designed to evaluate whether AI models can solve real-world software engineering tasks. Rather than testing code generation in isolation, SWE-bench presents models with actual GitHub issues from popular open-source Python repositories and asks them to produce a patch that resolves the issue and passes the associated test suite.

The benchmark was introduced in the 2024 paper SWE-bench: Can Language Models Resolve Real-World GitHub Issues? by Carlos E. Jimenez et al. at Princeton University.

How it works

Each SWE-bench task consists of:

A GitHub issue description — the natural-language problem statement as written by the original issue author.
A codebase snapshot — the state of the repository at the time the issue was filed.
A gold patch and test suite — the model’s output is evaluated by checking whether it passes the same tests used to validate the human-authored fix.

Models are scored on % resolved — the fraction of issues where the generated patch passes the full test suite. This makes SWE-bench more rigorous than benchmarks that only check if code compiles or passes a single test case.

SWE-bench Verified

The original SWE-bench dataset contains 2,294 tasks, but not all of them are well-specified or reliably solvable. To address this, OpenAI collaborated with the SWE-bench team to create SWE-bench Verified — a human-filtered subset of 500 tasks where annotators confirmed that:

The issue description contains enough information to identify the problem.
The test suite reliably validates correct solutions.
The task is not ambiguous or under-specified.

SWE-bench Verified is now the standard subset used for most leaderboard comparisons.

Current state of the leaderboard (early 2025)

On the Bash Only leaderboard — which evaluates all models on SWE-bench Verified using the same shell-based interface — the top models are resolving around 74% of issues:

Model	% Resolved (Verified)
Claude 4.5 Opus (medium)	74.40%
Gemini 3 Pro Preview	74.20%
Claude 4.5 Sonnet	70.60%
Claude 4 Opus (May 2025)	67.60%
GPT-5 (medium reasoning)	65.00%

These numbers have been climbing quickly — for context, the best scores were around 50% in early 2024.

Interpreting the results

It’s tempting to read “74% resolved” as meaning AI can fix 74% of real-world software bugs, but several important caveats apply:

Curated subset: SWE-bench Verified deliberately filters out ambiguous, under-documented, or hard-to-test issues. Real-world GitHub issues are messier.
Issue specification quality: In practice, much of the difficulty in software engineering lies in understanding vague requirements, reproducing bugs, and navigating large unfamiliar codebases. SWE-bench tasks are relatively well-scoped.
Single-repo Python focus: The benchmark currently draws from a set of well-maintained Python libraries (e.g., Django, scikit-learn, sympy). Generalization to other languages, less-documented codebases, or proprietary software is an open question.
No deployment or integration testing: SWE-bench tests whether a patch passes unit/integration tests, not whether it would be accepted in a real code review or function correctly at scale.

The self-driving car analogy

The trajectory of SWE-bench scores is reminiscent of autonomous driving predictions circa 2015–2017, when rapid progress on structured benchmarks led many companies to predict full autonomy was just a year or two away. A decade later, the long tail of edge cases turned out to be the hardest part.

Similarly, while the pace of improvement on SWE-bench is genuinely impressive, the remaining 25–30% of unresolved issues — and the much larger space of tasks not captured by the benchmark — may prove disproportionately difficult. Benchmarks measure a specific, well-defined slice of capability, and the gap between benchmark performance and reliable, general-purpose software engineering likely remains significant.

Why it matters

Despite these caveats, SWE-bench provides a useful signal for tracking progress in AI-assisted software engineering. It tests end-to-end problem-solving (reading an issue, understanding a codebase, writing a correct fix) rather than narrow code completion, making it one of the more meaningful benchmarks for evaluating practical coding ability.

For researchers and practitioners in ML, SWE-bench offers:

A rough barometer for how quickly AI coding capabilities are improving.
A reality check on what “AI can code” actually means today — useful for calibrating expectations when adopting AI tools.
An evaluation framework that can be adapted for domain-specific benchmarks (e.g., testing AI on bioinformatics pipelines or data analysis workflows).