Skip to content
Article Issue #5188

LLM-as-Judge

What to know

LLM-as-Judge is an evaluation pattern in which a powerful language model, acting as an evaluator, is prompted to score or compare candidate outputs against reference criteria or a ground-truth answer; The judge model receives a rubric, the original prompt, and one or more candidate responses; LLM-as-Judge enables automated regression testing for prompt changes, model upgrades, and feature launches without blocking on human evaluation cycles

LLM-as-Judge, WikiWalls Glossary illustration

« Back to Glossary Index

LLM-as-Judge is an evaluation pattern in which a powerful language model, acting as an evaluator, is prompted to score or compare candidate outputs against reference criteria or a ground-truth answer. It scales AI evaluation to thousands of samples without requiring human raters for every judgment, making it practical for continuous regression testing and large-scale eval runs.

How it works

The judge model receives a rubric, the original prompt, and one or more candidate responses. It returns a score, a preference (which response is better), or a pass/fail verdict along with an explanation. Pairwise comparison, asking which of two responses is better, typically yields more reliable signals than absolute scoring. Calibration against human judgments validates that the judge model’s decisions correlate with human preferences.

Key facts

  • Popularized by: MT-Bench and Chatbot Arena (LMSYS, 2023) demonstrated that GPT-4 judgments correlated strongly with human preferences.
  • Bias risks: Judge models exhibit position bias (favoring the first response), verbosity bias, and self-enhancement bias.
  • Best practice: Use a model at least as capable as the model being judged; using a weaker judge produces unreliable results.
  • Cost: Running a judge model doubles or triples eval token costs compared to generation alone.

For builders

LLM-as-Judge enables automated regression testing for prompt changes, model upgrades, and feature launches without blocking on human evaluation cycles. Builders should calibrate judge prompts against a human-annotated gold set to verify alignment before using judge scores as primary quality gates. Logging judge reasoning alongside scores provides interpretable signal for debugging model regressions.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.