Google DeepMind has developed a multi-pass online approach using reinforcement learning to enhance the self-correction capabilities of large language models (LLMs).
It has been shown that supervised fine-tuning (SFT) is ineffective for learning self-correction and faces a mismatch between training data and the model’s responses. To address this issue, a two-stage approach is proposed, which first optimizes self-correction behavior and then uses an additional reward to reinforce self-correction during training. This method relies entirely on data generated by the model itself.
When applied to the Gemini 1.0 Pro and 1.5 Flash models, it achieved record-breaking self-correction performance, improving the baseline models by 15.6% and 9.1%, respectively, in MATH and HumanEval tests.
Comprehensive Evaluation of Quantized LLMs Tuned for Instruction-Following
This study evaluates the performance of LLMs trained for instruction-following under different quantization methods for models with parameters ranging from 7B to 405B.
Key findings from the study:
- Quantizing a larger LLM to a similar size as a smaller FP16 LLM generally yields better results on most tests.
- Performance varies significantly depending on the quantization methods, model size, and bit depth. Methods using weight-only quantization often deliver the best results for larger models.
- The complexity of tasks does not significantly affect the reduction in accuracy caused by quantization.
Do Large Language Models (LLMs) Have Memory?
According to current results, LLMs do indeed demonstrate memory. But what is the mechanism behind this memory?
The article explores LLMs’ memory capabilities, using the Universal Approximation Theorem (UAT) to explain their memory mechanism. It also proposes a new approach to evaluating LLM performance by comparing the memory capacities of different models.
The Transformer architecture operates as a dynamic approximation model under UAT, with a high capacity to adapt to input data. As a result, LLMs can recall entire content based on minimal input information. Since this memory can only be verified when activated by input data, we call it “Schrödinger’s Memory.”
Logic-of-Thought for Comprehensive Reasoning in LLMs
Logic-of-Thought (LoT) is a new prompting technique that uses propositional logic to generate and add enhanced logical information from the input context.
LoT improves Chain-of-Thought (CoT) performance on the ReClor dataset by +4.35%, increases CoT+SelfConsistency efficiency on LogiQA by +5%, and enhances ToT results on the ProofWriter dataset by +8%.
Conclusion
- Self-correction in LLMs through reinforcement learning: Google DeepMind proposed an innovative approach to enhance the self-correction capabilities of large language models (LLMs), improving the Gemini models by 15.6% and 9.1% in MATH and HumanEval tests, respectively. This highlights the significant potential of using model-generated data in LLM training.
- Quantization of LLMs for instruction-following: Quantization methods have a substantial impact on LLM performance, with quantizing larger models (405B parameters) often yielding better results.
- Memory in LLMs: It has been proven that LLMs possess memory, and the Transformer architecture functions as an approximation mechanism for input data, demonstrating “Schrödinger’s Memory,” which is activated by the input.
- Logic-of-Thought for improving LLM reasoning: The new Logic-of-Thought (LoT) approach showed significant improvements in LLM performance on logical reasoning tasks, particularly on the ReClor, LogiQA, and ProofWriter datasets, confirming the effectiveness of using logical structures to enhance model performance.
Together, these studies emphasize the importance of adapting existing LLMs and implementing new approaches to improve their performance in tasks involving self-correction, quantization, memory, and logical reasoning.