Skip to content
Article Issue #5485

Constitutional AI

What to know

Constitutional AI is a training methodology developed by Anthropic in which a language model is taught to critique and revise its own outputs against a written list of principles, reducing the need for large volumes of human feedback on harmful content and giving the model an explicit, inspectable set of values to reason from during training; the approach scales alignment work without scaling the human labeling burden at the same rate, and its design principles have influenced how safety researchers think about value specification across the broader AI safety field

Constitutional AI -- WikiWalls Glossary illustration

« Back to Glossary Index

A Constitutional AI (CAI) is a training methodology developed by Anthropic in which a language model is taught to evaluate and revise its own outputs according to a written set of principles, called a constitution. Rather than relying solely on human labelers to flag harmful responses, the model performs self-critique and self-revision steps during training, learning to align its behavior with the specified values. The approach was introduced in Bai et al. (2022) and has since influenced alignment research more broadly.

How it works

Constitutional AI operates in two phases. In the supervised learning phase, the model generates a response to a prompt, then uses the constitution to critique that response (“Does this response comply with principle X?”) and rewrites it. These revised outputs form a new supervised fine-tuning dataset. In the reinforcement learning phase, an AI Feedback (AIF) signal replaces or supplements human feedback: a separate model scores candidate responses against the constitution, and those preference labels drive RLHF-style training. This is sometimes called RLAIF (Reinforcement Learning from AI Feedback).

Key concepts

  • Constitution: the written list of principles the model must follow. Anthropic’s original constitution drew from the Universal Declaration of Human Rights, Apple’s terms of service, and DeepMind safety principles, among other sources.
  • Self-critique and revision: the model is prompted to identify whether its own output violates a specific principle and to produce a revised version that does not. Multiple critique-revision cycles can be chained.
  • RLAIF: using AI-generated preference labels (from a model scoring responses against the constitution) instead of or in addition to human preference labels, reducing the human labeling burden for sensitive content.
  • Scalable oversight: Constitutional AI is one instantiation of scalable oversight techniques, which aim to maintain alignment quality as models become more capable than human evaluators in specific domains.
  • Red teaming: adversarial prompting used to surface failure modes before and after CAI training; the critique-revision loop is evaluated against red-team examples during development.

Why it matters for builders

Constitutional AI is not just an Anthropic-internal method; it is a published framework that any organization training language models can apply. The key insight is that a written specification of values is cheaper to produce and iterate on than a large human-labeled preference dataset for harmful content, and it makes the model’s guiding values inspectable rather than implicit. For practitioners evaluating models, CAI-trained models tend to exhibit consistent refusal patterns that map to explicit policy text, which makes behavior more predictable in production than models whose safety behavior was purely empirically tuned. Claude models are trained with CAI as part of their alignment process.

Sources

  • Bai, Y., Jones, A., Ndousse, K., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. arxiv.org
  • Askell, A., Bai, Y., Chen, A., et al. (2021). A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861. arxiv.org
  • Ouyang, L., Wu, J., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 35. neurips.cc
  • Ganguli, D., et al. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.07858. arxiv.org
  • Anthropic. (2023). Claude Model Card and Evaluations. anthropic.com
  • Perez, E., et al. (2022). Red Teaming Language Models with Language Models. arXiv:2202.03286. arxiv.org
« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.