Skip to content
Article Issue #5192

MMLU

What to know

MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Hendrycks et al; The model receives a question and four answer choices labeled A through D and must select the correct option; MMLU scores provide a broad signal for knowledge coverage but should not be the sole criterion for model selection in production applications

MMLU, WikiWalls Glossary illustration

« Back to Glossary Index

MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Hendrycks et al. in 2020 that evaluates language model knowledge and reasoning across 57 diverse subjects at the elementary through professional difficulty range. Each question is a 4-choice multiple-choice item drawn from practice exams, textbooks, and academic assessments.

How it works

The model receives a question and four answer choices labeled A through D and must select the correct option. Accuracy across all 57 subjects and 14,000 test questions produces an overall MMLU score. Subject-level breakdowns reveal domain-specific strengths and weaknesses. Few-shot prompting with 5 examples per subject is the standard evaluation protocol.

Key facts

  • Subjects: Covers STEM, humanities, social sciences, law, medicine, and professional domains.
  • Scores: Human expert average is around 90 percent. GPT-4 class models exceed 85 percent; frontier models approach 90 percent.
  • Saturation concern: Leading models now score at or near human expert level, reducing the benchmark’s discriminative power.
  • Contamination: Some test questions have appeared in model training data, potentially inflating reported scores.

For builders

MMLU scores provide a broad signal for knowledge coverage but should not be the sole criterion for model selection in production applications. For domain-specific deployments such as legal, medical, or financial tools, builders should run targeted in-domain evals that reflect the actual question types and difficulty levels users will encounter rather than relying on aggregate MMLU percentages.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.