Article Issue #5177

RLHF (Reinforcement Learning from Human Feedback)

What to know

RLHF (Reinforcement Learning from Human Feedback) is a multi-stage training procedure that refines a pretrained language model using signals derived from human evaluators comparing pairs of model outputs; RLHF proceeds in three phases; Builders using RLHF-trained models benefit from more instruction-following behavior and safer defaults, but should still apply system prompts and evals to shape outputs for their specific use case

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

RLHF (Reinforcement Learning from Human Feedback) is a multi-stage training procedure that refines a pretrained language model using signals derived from human evaluators comparing pairs of model outputs. The goal is to shift the model’s behavior toward responses that humans rate as more helpful, accurate, and safe, beyond what next-token prediction on text corpora achieves.

How it works

RLHF proceeds in three phases. First, a supervised fine-tuning phase trains the model on high-quality demonstrations. Second, human raters compare model output pairs and their preferences train a reward model. Third, the language model is optimized using proximal policy optimization (PPO) to maximize the reward model’s score while a KL divergence penalty prevents the model from drifting too far from the original fine-tuned policy.

Key facts

Popularized by: InstructGPT (OpenAI, 2022) demonstrated that RLHF produces significantly more helpful models than supervised fine-tuning alone.
Alternatives: DPO (Direct Preference Optimization) achieves similar results to RLHF without a separate reward model training step.
Human labelers: Quality and diversity of human raters is a significant bottleneck and cost driver in RLHF pipelines.
RLAIF: Reinforcement Learning from AI Feedback uses model-generated preference labels to scale the rating process.

For builders

Builders using RLHF-trained models benefit from more instruction-following behavior and safer defaults, but should still apply system prompts and evals to shape outputs for their specific use case. Understanding that RLHF optimizes for rated human preference helps explain why models sometimes give verbose or overly hedged answers that score well in rating studies but may not be optimal for a particular product context.

Sources

« Back to Definition Index

If this saved you an afternoon — and we will send the next one straight to your inbox.

Wikiwalls Team

Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

How it works

Key facts

For builders

Sources

More from WikiWalls

Cursor vs Copilot vs Cody vs Windsurf, after a 30-day production diary

The Cheapest Production-Grade LLM, ranked at constant output quality

Best Mini-PC for Homelab: Beelink, Minisforum, GMKtec Tested

Best AI Note Apps: Mem vs Reflect vs Tana vs Saner.ai

One careful fix in your inbox each Wednesday.