Skip to content
Article Issue #5190

HumanEval

What to know

HumanEval is a benchmark dataset created by OpenAI to evaluate the functional correctness of code generated by language models; The benchmark supplies the function signature and docstring as the prompt; HumanEval is a useful signal when comparing models for code generation features, but its 164 simple problems represent a narrow slice of real-world engineering tasks

HumanEval, WikiWalls Glossary illustration

« Back to Glossary Index

HumanEval is a benchmark dataset created by OpenAI to evaluate the functional correctness of code generated by language models. Each problem provides a function signature, docstring, and a set of hidden unit tests; a model’s score is the percentage of problems for which its generated code passes all unit tests on the first attempt, a metric called pass@1.

How it works

The benchmark supplies the function signature and docstring as the prompt. The model generates a function body. The solution is executed against the hidden test suite, and pass/fail is determined by whether all assertions pass. Pass@k metrics allow k candidate solutions to be generated per problem; the problem is counted as solved if any one of the k solutions passes.

Key facts

  • Size: 164 handwritten Python programming problems of varying difficulty.
  • Metric: pass@1 (single-sample pass rate) and pass@k are the primary reported metrics.
  • Scores: GPT-4o achieves around 90 percent pass@1; frontier models now regularly exceed this.
  • Limitations: Problems are relatively straightforward algorithmic tasks; they do not capture repository-level or multi-file coding ability.

For builders

HumanEval is a useful signal when comparing models for code generation features, but its 164 simple problems represent a narrow slice of real-world engineering tasks. Builders selecting models for coding assistants should supplement HumanEval scores with SWE-Bench results and internal evals on code representative of their own codebase and task complexity.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.