Skip to content
Article Issue #5177

RLHF (Reinforcement Learning from Human Feedback)

What to know

RLHF (Reinforcement Learning from Human Feedback) is a multi-stage training procedure that refines a pretrained language model using signals derived from human evaluators comparing pairs of model outputs; RLHF proceeds in three phases; Builders using RLHF-trained models benefit from more instruction-following behavior and safer defaults, but should still apply system prompts and evals to shape outputs for their specific use case

RLHF (Reinforcement Learning from Human Feedback), WikiWalls Glossary illustration

« Back to Glossary Index

RLHF (Reinforcement Learning from Human Feedback) is a multi-stage training procedure that refines a pretrained language model using signals derived from human evaluators comparing pairs of model outputs. The goal is to shift the model’s behavior toward responses that humans rate as more helpful, accurate, and safe, beyond what next-token prediction on text corpora achieves.

How it works

RLHF proceeds in three phases. First, a supervised fine-tuning phase trains the model on high-quality demonstrations. Second, human raters compare model output pairs and their preferences train a reward model. Third, the language model is optimized using proximal policy optimization (PPO) to maximize the reward model’s score while a KL divergence penalty prevents the model from drifting too far from the original fine-tuned policy.

Key facts

  • Popularized by: InstructGPT (OpenAI, 2022) demonstrated that RLHF produces significantly more helpful models than supervised fine-tuning alone.
  • Alternatives: DPO (Direct Preference Optimization) achieves similar results to RLHF without a separate reward model training step.
  • Human labelers: Quality and diversity of human raters is a significant bottleneck and cost driver in RLHF pipelines.
  • RLAIF: Reinforcement Learning from AI Feedback uses model-generated preference labels to scale the rating process.

For builders

Builders using RLHF-trained models benefit from more instruction-following behavior and safer defaults, but should still apply system prompts and evals to shape outputs for their specific use case. Understanding that RLHF optimizes for rated human preference helps explain why models sometimes give verbose or overly hedged answers that score well in rating studies but may not be optimal for a particular product context.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.