Google DeepMind has developed a multi-pass online approach using reinforcement learning to enhance the self-correction capabilities of large language models (LLMs).
It has been shown that supervised fine-tuning (SFT) is ineffective for learning self-correction and faces a mismatch between training data and the model’s responses. To address this issue, a two-stage approach is proposed, which first optimizes self-correction behavior and then uses an additional reward to reinforce self-correction during training. This method relies entirely on data generated by the model itself.
When applied to the Gemini 1.0 Pro and 1.5 Flash models, it achieved record-breaking self-correction performance, improving the baseline models by 15.6% and 9.1%, respectively, in MATH and HumanEval tests.
Read more →