Reinforcement fine-tuning // RFL
Training Language Models to Self-Correct via Reinforcement Learning
In an exciting unveiling, OpenAI has taken yet another step forward in the evolution of artificial intelligence with the introduction of reinforcement fine-tuning (RFT) for their latest O1 model series. The O1 series models are now equipped with reinforcement fine-tuning, a method that goes beyond the conventional fine-tuning techniques. While traditional fine-tuning focuses on mimicking input patterns, RFT leverages the power of reinforcement learning to enable models to think and reason through problems in entirely new ways.
One notable application of RFT is in the legal sector, where OpenAI has partnered with Thomson Reuters to enhance their co-counsel AI with a legal assistant fine-tuned using RFT. This tool assists legal professionals in navigating complex analytical workflows, showcasing how RFT can transform industry-specific tasks.
In scientific research, Justin Ree, a computational biologist from Berkeley Lab, has demonstrated the potential of RFT in understanding rare genetic diseases. By training the O1 models on curated datasets of symptoms and genetic data, researchers like Justin aim to accelerate the diagnosis and treatment of rare diseases affecting millions globally.
OpenAI’s engineers have made it possible to improve model performance with as few as a dozen examples — a feat that sets RFT apart from traditional fine-tuning methods. The process involves using graders, which evaluate model outputs against known correct answers, fostering a learning environment where models gain expert reasoning abilities over custom domains.
Excitingly, OpenAI is expanding access to RFT through their Reinforcement Fine-Tuning Research Program, aimed at organizations working on complex problems that could benefit from AI. This program opens the door for developers, researchers, and machine learning engineers to harness the potential of O1 models for their specific tasks, pushing the boundaries of what’s possible with AI.
As OpenAI prepares for a public launch of reinforcement fine-tuning next year, the possibilities seem limitless. By enabling models to learn and reason at an expert level, OpenAI is not only advancing AI technology but also empowering industries to tackle challenges with greater precision and effectiveness. We can’t wait to see what the future holds as more teams incorporate RFT into their toolkit, driving innovation and progress in ways we have yet to imagine.
What is Reinforcement Fine-Tuning?
This new model customization technique enables developers to customize our models using dozens to thousands of high quality tasks and grade the model’s response with provided reference answers. This technique reinforces how the model reasons through similar problems and improves its accuracy on specific tasks in that domain.
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model’s own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
Overall, RFT is probably not meant for every task out there but I think it might bring groundbreaking results in scientific domains and will play a role in model advancement for scientific discovery.