MMLU
MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Hendrycks et al; The model receives a question and four answer choices labeled A through D and must select the correct option; MMLU scores provide a broad signal for knowledge coverage but should not be the sole criterion for model selection in production applications
MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Hendrycks et al. in 2020 that evaluates language model knowledge and reasoning across 57 diverse subjects at the elementary through professional difficulty range. Each question is a 4-choice multiple-choice item drawn from practice exams, textbooks, and academic assessments.
How it works
The model receives a question and four answer choices labeled A through D and must select the correct option. Accuracy across all 57 subjects and 14,000 test questions produces an overall MMLU score. Subject-level breakdowns reveal domain-specific strengths and weaknesses. Few-shot prompting with 5 examples per subject is the standard evaluation protocol.
Key facts
- Subjects: Covers STEM, humanities, social sciences, law, medicine, and professional domains.
- Scores: Human expert average is around 90 percent. GPT-4 class models exceed 85 percent; frontier models approach 90 percent.
- Saturation concern: Leading models now score at or near human expert level, reducing the benchmark’s discriminative power.
- Contamination: Some test questions have appeared in model training data, potentially inflating reported scores.
For builders
MMLU scores provide a broad signal for knowledge coverage but should not be the sole criterion for model selection in production applications. For domain-specific deployments such as legal, medical, or financial tools, builders should run targeted in-domain evals that reflect the actual question types and difficulty levels users will encounter rather than relying on aggregate MMLU percentages.
Sources
- Hendrycks, D., et al. (2020). Measuring Massive Multitask Language Understanding (MMLU). arXiv:2009.03300. arxiv.org
- Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code (HumanEval). arXiv:2107.03374. arxiv.org
- Jimenez, C., et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770. arxiv.org
- Stanford CRFM. Holistic Evaluation of Language Models (HELM). crfm.stanford.edu
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. arxiv.org