RLHF (Reinforcement Learning from Human Feedback)
RLHF (Reinforcement Learning from Human Feedback) is a multi-stage training procedure that refines a pretrained language model using signals derived from human evaluators comparing pairs of model outputs; RLHF proceeds in three phases; Builders using RLHF-trained models benefit from more instruction-following behavior and safer defaults, but should still apply system prompts and evals to shape outputs for their specific use case
RLHF (Reinforcement Learning from Human Feedback) is a multi-stage training procedure that refines a pretrained language model using signals derived from human evaluators comparing pairs of model outputs. The goal is to shift the model’s behavior toward responses that humans rate as more helpful, accurate, and safe, beyond what next-token prediction on text corpora achieves.
How it works
RLHF proceeds in three phases. First, a supervised fine-tuning phase trains the model on high-quality demonstrations. Second, human raters compare model output pairs and their preferences train a reward model. Third, the language model is optimized using proximal policy optimization (PPO) to maximize the reward model’s score while a KL divergence penalty prevents the model from drifting too far from the original fine-tuned policy.
Key facts
- Popularized by: InstructGPT (OpenAI, 2022) demonstrated that RLHF produces significantly more helpful models than supervised fine-tuning alone.
- Alternatives: DPO (Direct Preference Optimization) achieves similar results to RLHF without a separate reward model training step.
- Human labelers: Quality and diversity of human raters is a significant bottleneck and cost driver in RLHF pipelines.
- RLAIF: Reinforcement Learning from AI Feedback uses model-generated preference labels to scale the rating process.
For builders
Builders using RLHF-trained models benefit from more instruction-following behavior and safer defaults, but should still apply system prompts and evals to shape outputs for their specific use case. Understanding that RLHF optimizes for rated human preference helps explain why models sometimes give verbose or overly hedged answers that score well in rating studies but may not be optimal for a particular product context.
Sources
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. arxiv.org
- Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. arxiv.org
- Christiano, P., et al. (2017). Deep Reinforcement Learning from Human Preferences. arXiv:1706.03741. arxiv.org
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. arxiv.org
- Hugging Face. Training Transformers documentation. huggingface.co