LLM-as-Judge
LLM-as-Judge is an evaluation pattern in which a powerful language model, acting as an evaluator, is prompted to score or compare candidate outputs against reference criteria or a ground-truth answer; The judge model receives a rubric, the original prompt, and one or more candidate responses; LLM-as-Judge enables automated regression testing for prompt changes, model upgrades, and feature launches without blocking on human evaluation cycles
LLM-as-Judge is an evaluation pattern in which a powerful language model, acting as an evaluator, is prompted to score or compare candidate outputs against reference criteria or a ground-truth answer. It scales AI evaluation to thousands of samples without requiring human raters for every judgment, making it practical for continuous regression testing and large-scale eval runs.
How it works
The judge model receives a rubric, the original prompt, and one or more candidate responses. It returns a score, a preference (which response is better), or a pass/fail verdict along with an explanation. Pairwise comparison, asking which of two responses is better, typically yields more reliable signals than absolute scoring. Calibration against human judgments validates that the judge model’s decisions correlate with human preferences.
Key facts
- Popularized by: MT-Bench and Chatbot Arena (LMSYS, 2023) demonstrated that GPT-4 judgments correlated strongly with human preferences.
- Bias risks: Judge models exhibit position bias (favoring the first response), verbosity bias, and self-enhancement bias.
- Best practice: Use a model at least as capable as the model being judged; using a weaker judge produces unreliable results.
- Cost: Running a judge model doubles or triples eval token costs compared to generation alone.
For builders
LLM-as-Judge enables automated regression testing for prompt changes, model upgrades, and feature launches without blocking on human evaluation cycles. Builders should calibrate judge prompts against a human-annotated gold set to verify alignment before using judge scores as primary quality gates. Logging judge reasoning alongside scores provides interpretable signal for debugging model regressions.
Sources
- Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code (HumanEval). arXiv:2107.03374. arxiv.org
- Jimenez, C., et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770. arxiv.org
- Hendrycks, D., et al. (2020). Measuring Massive Multitask Language Understanding (MMLU). arXiv:2009.03300. arxiv.org
- Stanford CRFM. Holistic Evaluation of Language Models (HELM). crfm.stanford.edu
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. arxiv.org