Self-Rewarding Language Models // machines train machines
“To achieve superhuman agents, future models require superhuman feedback” — no need for humans
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.
Here is a summary of the key points from the research paper:
- The paper proposes Self-Rewarding Language Models, where the language model provides its own rewards during training via LLM-as-a-Judge prompting. This allows the model to continually improve its ability to provide high-quality rewards.
- Current RLHF methods learn a separate frozen reward model from human preferences. This can bottleneck performance at the human level. Direct Preference Optimization (DPO) avoids a separate model but still uses fixed human preferences.
- Self-Rewarding Models possess both instruction following and self-instruction creation skills. The latter involves generating its own prompts and candidate responses, and evaluating them via LLM-as-a-Judge to assign rewards.
- Training is done iteratively, with each iteration’s model generating new preference pairs as training data for the next iteration. This allows both instruction following and reward modeling skills to improve across iterations.
- Experiments using Llama 2 70B show performance gains over a standard fine-tuned baseline, and the model’s ability to provide high-quality rewards improves across iterations. After 3 iterations the model outperforms Claude 2, Gemini Pro, and GPT-4 0613 on the AlpacaEval 2.0 leaderboard.
- While preliminary, this demonstrates the promise of models that can provide increasingly better rewards to themselves, potentially exceeding human levels. Further research is needed to understand scaling behavior and safety.
Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a large language model (LLM) to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.
Here is a summary of the key points from the paper:
- The paper introduces a ReAct-style LLM agent with self-critique for long-form question answering. The agent performs multi-step reasoning by interleaving chain-of-thought with search API calls and observations.
- The agent is refined through a ReST-like method that iteratively trains on agent trajectories with AI feedback for continuous self-improvement and self-distillation.
- Starting from a large pre-trained model, after just two iterations the algorithm produces a small fine-tuned model with comparable performance on compositional QA benchmarks using 2 orders of magnitude fewer parameters.
- The method allows improving agent robustness without human-labeled training data, through self-critique, AI feedback, and synthetic data generation.
- Evaluations are done on the Bamboogle benchmark of compositional questions unanswerable by direct search, and a new complementary dataset BamTwoogle.
- Ablations analyze the impact of human filtering, number of trajectories per question, and self-critique steps.
In summary, the paper demonstrates an effective application of ReST ideas to iteratively improve a reasoning LLM agent with AI feedback, while also distilling it into much smaller models. The key innovation is self-improvement without human labels by leveraging process supervision and synthetic data.