Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high-quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) — a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find that they result in similar improvements. On the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline supervised fine-tuned model in ∼70% of cases. Furthermore, when asked to rate RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results suggest that RLAIF can yield human-level performance, offering a potential solution to the scalability limitations of RLHF.
RLAIF (Reinforcement Learning from AI Feedback) trains language models using preferences labeled by an AI system instead of humans. This work compares RLAIF to RLHF (Reinforcement Learning from Human Feedback) on the task of text summarization. The key findings are:
1) RLAIF achieves comparable performance to RLHF. Both RLAIF and RLHF summaries are preferred over a supervised fine-tuned baseline around 70% of the time. When compared head-to-head, RLAIF and RLHF are equally preferred.
2) Various prompting techniques were tested to maximize alignment of AI preferences with human preferences. Using a detailed task description and eliciting chain-of-thought reasoning improved alignment the most. Surprisingly, few-shot examples did not help.
3) Larger AI labeler model sizes improved alignment. After labeling with a large model, a reward model can be trained with just a few thousand examples.
4) Qualitatively, RLAIF summaries tend to hallucinate less than RLHF but can be less coherent. Overall, both generate high quality summaries.
The key conclusion is that RLAIF is a viable alternative to RLHF that does not require human annotation, offering more scalable learning from preferences.