Here is a summary of the key points from the research paper:
- The authors introduce phi-1, a new 1.3 billion parameter Transformer-based language model for code generation.
- Phi-1 is trained on high quality “textbook” data including filtered code from the web and synthetically generated textbooks and exercises using GPT-3.5. The total training data is only 7 billion tokens.
- Despite the small model and dataset size, phi-1 achieves state-of-the-art results on code generation benchmarks like HumanEval (50.6% pass@1) and MBPP (55.5% pass@1).
- The authors argue that high quality, instructive training data is crucial for efficient learning in language models, allowing strong performance with less scale.
- They show phi-1 displays emergent capabilities not seen in the base model before finetuning, like using external libraries, suggesting the finetuning consolidates knowledge.
- Limitations include sensitivity to prompt variations, weaker natural language handling, and lower performance on spatial/counting tasks compared to much larger models.
- The work demonstrates the impact of highly curated training data and provides evidence that methodology for creating such datasets is key for advancing language models.
In summary, the paper introduces an efficient method to train performant code generation models using carefully curated data, opening up possibilities for smaller scale but still capable LLMs. The results also highlight the importance of training data quality and synthesis for reaching new frontiers in language model performance.
Here are the key points:
- Outcome and process supervision are two methods for training reliable reward models to solve math problems. Outcome supervision provides feedback only on the final result, while process supervision provides feedback on each step.
- Process supervision has benefits for interpretability and alignment, since it rewards following human-endorsed reasoning rather than just reaching the right outcome.
- In large-scale experiments, the process-supervised reward model (PRM) significantly outperformed the outcome-supervised reward model (ORM), solving 78% vs 72% of test problems from the challenging MATH dataset when selecting the best solution out of 1860 samples.
- In controlled small-scale experiments, process supervision also outperformed outcome supervision. Using active learning to select the most “convincing” wrong solutions improved data efficiency of process supervision by 2.6x.
- The PRM showed some generalization, performing better than the ORM on recent out-of-distribution STEM test problems.
- The advantages of process supervision may lead to its increased adoption, since it did not incur an “alignment tax” of reduced performance.
- The full dataset of 800K human process supervision labels (PRM800K) was released to enable further research.
Evaluating LLMs is hard:
prompt sensitivity, construct validity, contamination.
Faulty methods in research on LLMs and research using LLMs.