Article Issue #5192

MMLU

What to know

MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Hendrycks et al; The model receives a question and four answer choices labeled A through D and must select the correct option; MMLU scores provide a broad signal for knowledge coverage but should not be the sole criterion for model selection in production applications

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Hendrycks et al. in 2020 that evaluates language model knowledge and reasoning across 57 diverse subjects at the elementary through professional difficulty range. Each question is a 4-choice multiple-choice item drawn from practice exams, textbooks, and academic assessments.

How it works

The model receives a question and four answer choices labeled A through D and must select the correct option. Accuracy across all 57 subjects and 14,000 test questions produces an overall MMLU score. Subject-level breakdowns reveal domain-specific strengths and weaknesses. Few-shot prompting with 5 examples per subject is the standard evaluation protocol.

Key facts

Subjects: Covers STEM, humanities, social sciences, law, medicine, and professional domains.
Scores: Human expert average is around 90 percent. GPT-4 class models exceed 85 percent; frontier models approach 90 percent.
Saturation concern: Leading models now score at or near human expert level, reducing the benchmark’s discriminative power.
Contamination: Some test questions have appeared in model training data, potentially inflating reported scores.

For builders

MMLU scores provide a broad signal for knowledge coverage but should not be the sole criterion for model selection in production applications. For domain-specific deployments such as legal, medical, or financial tools, builders should run targeted in-domain evals that reflect the actual question types and difficulty levels users will encounter rather than relying on aggregate MMLU percentages.