Article Issue #5190

HumanEval

What to know

HumanEval is a benchmark dataset created by OpenAI to evaluate the functional correctness of code generated by language models; The benchmark supplies the function signature and docstring as the prompt; HumanEval is a useful signal when comparing models for code generation features, but its 164 simple problems represent a narrow slice of real-world engineering tasks

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

HumanEval is a benchmark dataset created by OpenAI to evaluate the functional correctness of code generated by language models. Each problem provides a function signature, docstring, and a set of hidden unit tests; a model’s score is the percentage of problems for which its generated code passes all unit tests on the first attempt, a metric called pass@1.

How it works

The benchmark supplies the function signature and docstring as the prompt. The model generates a function body. The solution is executed against the hidden test suite, and pass/fail is determined by whether all assertions pass. Pass@k metrics allow k candidate solutions to be generated per problem; the problem is counted as solved if any one of the k solutions passes.

Key facts

Size: 164 handwritten Python programming problems of varying difficulty.
Metric: pass@1 (single-sample pass rate) and pass@k are the primary reported metrics.
Scores: GPT-4o achieves around 90 percent pass@1; frontier models now regularly exceed this.
Limitations: Problems are relatively straightforward algorithmic tasks; they do not capture repository-level or multi-file coding ability.

For builders

HumanEval is a useful signal when comparing models for code generation features, but its 164 simple problems represent a narrow slice of real-world engineering tasks. Builders selecting models for coding assistants should supplement HumanEval scores with SWE-Bench results and internal evals on code representative of their own codebase and task complexity.

Sources

« Back to Definition Index

If this saved you an afternoon — and we will send the next one straight to your inbox.

Wikiwalls Team

Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

How it works

Key facts

For builders

Sources

More from WikiWalls

Cursor vs Copilot vs Cody vs Windsurf, after a 30-day production diary

The Cheapest Production-Grade LLM, ranked at constant output quality

Best Mini-PC for Homelab: Beelink, Minisforum, GMKtec Tested

Best AI Note Apps: Mem vs Reflect vs Tana vs Saner.ai

One careful fix in your inbox each Wednesday.