Article Issue #5188

LLM-as-Judge

What to know

LLM-as-Judge is an evaluation pattern in which a powerful language model, acting as an evaluator, is prompted to score or compare candidate outputs against reference criteria or a ground-truth answer; The judge model receives a rubric, the original prompt, and one or more candidate responses; LLM-as-Judge enables automated regression testing for prompt changes, model upgrades, and feature launches without blocking on human evaluation cycles

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

LLM-as-Judge is an evaluation pattern in which a powerful language model, acting as an evaluator, is prompted to score or compare candidate outputs against reference criteria or a ground-truth answer. It scales AI evaluation to thousands of samples without requiring human raters for every judgment, making it practical for continuous regression testing and large-scale eval runs.

How it works

The judge model receives a rubric, the original prompt, and one or more candidate responses. It returns a score, a preference (which response is better), or a pass/fail verdict along with an explanation. Pairwise comparison, asking which of two responses is better, typically yields more reliable signals than absolute scoring. Calibration against human judgments validates that the judge model’s decisions correlate with human preferences.

Key facts

Popularized by: MT-Bench and Chatbot Arena (LMSYS, 2023) demonstrated that GPT-4 judgments correlated strongly with human preferences.
Bias risks: Judge models exhibit position bias (favoring the first response), verbosity bias, and self-enhancement bias.
Best practice: Use a model at least as capable as the model being judged; using a weaker judge produces unreliable results.
Cost: Running a judge model doubles or triples eval token costs compared to generation alone.

For builders

LLM-as-Judge enables automated regression testing for prompt changes, model upgrades, and feature launches without blocking on human evaluation cycles. Builders should calibrate judge prompts against a human-annotated gold set to verify alignment before using judge scores as primary quality gates. Logging judge reasoning alongside scores provides interpretable signal for debugging model regressions.