HumanEval
HumanEval is a benchmark dataset created by OpenAI to evaluate the functional correctness of code generated by language models; The benchmark supplies the function signature and docstring as the prompt; HumanEval is a useful signal when comparing models for code generation features, but its 164 simple problems represent a narrow slice of real-world engineering tasks
HumanEval is a benchmark dataset created by OpenAI to evaluate the functional correctness of code generated by language models. Each problem provides a function signature, docstring, and a set of hidden unit tests; a model’s score is the percentage of problems for which its generated code passes all unit tests on the first attempt, a metric called pass@1.
How it works
The benchmark supplies the function signature and docstring as the prompt. The model generates a function body. The solution is executed against the hidden test suite, and pass/fail is determined by whether all assertions pass. Pass@k metrics allow k candidate solutions to be generated per problem; the problem is counted as solved if any one of the k solutions passes.
Key facts
- Size: 164 handwritten Python programming problems of varying difficulty.
- Metric: pass@1 (single-sample pass rate) and pass@k are the primary reported metrics.
- Scores: GPT-4o achieves around 90 percent pass@1; frontier models now regularly exceed this.
- Limitations: Problems are relatively straightforward algorithmic tasks; they do not capture repository-level or multi-file coding ability.
For builders
HumanEval is a useful signal when comparing models for code generation features, but its 164 simple problems represent a narrow slice of real-world engineering tasks. Builders selecting models for coding assistants should supplement HumanEval scores with SWE-Bench results and internal evals on code representative of their own codebase and task complexity.
Sources
- Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code (HumanEval). arXiv:2107.03374. arxiv.org
- Jimenez, C., et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770. arxiv.org
- Hendrycks, D., et al. (2020). Measuring Massive Multitask Language Understanding (MMLU). arXiv:2009.03300. arxiv.org
- Stanford CRFM. Holistic Evaluation of Language Models (HELM). crfm.stanford.edu
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. arxiv.org