Reasoning models // how do they work, why should we care

Reasoning is a core feature of intelligence, but how can it be imitated at the scale of cutting-edge hardware and large models?

9 min readSep 13, 2024

OpenAI announced new ‘reasoning’ models. How do they differ from ordinary LLMs? More time for “thinking” — more iterations for finding right solutions. The fundamental principle of computation is that any problem can be solved with unlimited time. In simple words, more time (compute) — better results.

The AI competition isn’t about billions of parameters or large GPU clusters. At an absolute scale, size doesn’t matter; efficiency is a balance of energy (compute), space (memory), and time (time).

But still, all these models simply repeat pre-trained reasoning patterns and are unable to synthesize new knowledge like natural organisms can. Just seq-to-seq generation. From completions models to instrc-models, and now CoT models. But it’s just a incremental improvements on some tests / benches, there are still errors/hallucinations.

https://openai.com/index/introducing-openai-o1-preview/

OpenAI has introduced a new series of AI models called “o1-preview,” focused on advanced reasoning for solving complex problems in fields like science, math, and coding. These models, starting with the first release on 9/12, are designed to spend more time thinking through problems, resulting in higher accuracy and problem-solving abilities. For example, the new model scored 83% on an International Mathematics Olympiad qualifying exam, compared to 13% by GPT-4.

The o1 models have strong safety mechanisms, performing well on tests designed to prevent users from bypassing safety rules. OpenAI has enhanced its safety measures through collaborations with U.S. and U.K. AI Safety Institutes.

The o1 series includes the “o1-preview” for complex reasoning tasks and “o1-mini,” a faster, cost-effective model focused on coding. Both are available to ChatGPT Plus, Team, and Enterprise users, with developers able to access them through the API. Future updates will add features like browsing and file uploads.

https://i.vimeocdn.com/video/1925755079-d7f47bee12853cc77e0c57709d698b1fcef93eb3c29e865667723b2516425bd1-d

https://x.com/sama/status/1834283100639297910

here is o1, a series of our most capable and aligned models yet:
https://openai.com/index/learning-to-reason-with-llms/
o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.
but also, it is the beginning of a new paradigm: AI that can do general-purpose complex reasoning.
o1-preview and o1-mini are available today (ramping over some number of hours) in ChatGPT for plus and team users and our API for tier 5 users.
screenshot of eval results in the tweet above and more in the blog post, but worth especially noting:
a fine-tuned version of o1 scored at the 49th percentile in the IOI under competition conditions! and got gold with 10k submissions per problem.
extremely proud of the team; this was a monumental effort across the entire company.
hope you enjoy it!

The introduction of OpenAI o1, a new large language model (LLM), focuses on improving reasoning abilities through reinforcement learning (RL). Key highlights include:

- Performance: OpenAI o1 ranks in the 89th percentile on Codeforces programming challenges, excels in the USA Math Olympiad qualifiers (AIME), and surpasses human PhD-level accuracy in science benchmarks (GPQA). It also significantly outperforms GPT-4o across reasoning-heavy tasks and on 54/57 MMLU subcategories.

- Reinforcement Learning: The model is trained to refine its “chain of thought,” allowing it to solve complex problems by breaking them down and learning from mistakes. This approach improves both train-time and test-time performance.

- Benchmarks: On AIME, o1 solved 74% of math problems compared to GPT-4o’s 12%, and it performed better than human PhDs on the GPQA benchmark. Its coding skills also improved, achieving a high Elo rating in Codeforces and scoring competitively in the 2024 International Olympiad in Informatics (IOI).

-Safety: Chain of thought reasoning enhances alignment and safety by making the model’s thought process more transparent. It demonstrated improved performance on safety tests, including harmful prompt completion and jailbreak scenarios.

-Human Preference: While o1-preview excels in reasoning tasks (coding, math, data analysis), it underperforms in natural language tasks compared to GPT-4o.

- Future Development: OpenAI plans to refine o1 further and believes its reasoning capabilities will open new use cases in science, coding, and related fields.

https://x.com/tedx_ai/status/1834346066482987340

OpenAI’s o1 (aka strawberry) model is finally available and is proving to be INCREDIBLY powerful! 🚀
So how did they do it?
> Like how AlphaGo used compute to scale Monte Carlo Tree Search (MCTS), OpenAI is now allocating compute to “think” before it speaks — resulting in highly-optimized chains of thought
> o1 considers many possible paths to answer your question and then back-propagates its way to the best possible reasoning path by using a critic model to score all the best chains of thought in a tree structure
> By scaling inference-time compute, they are then able to bake in successful reasoning traces back into the o1 model to improve performance over time on synthetically generated chains of thoughts

https://x.com/DrJimFan/status/1834279865933332752

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there’re only 2 techniques that scale indefinitely with compute: learning & search. It’s time to shift focus to the latter.
1. You don’t need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small “reasoning core” that knows how to call tools like browser and code verifier. Pre-training compute may be decreased.
2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo’s monte carlo tree search (MCTS).
3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month:
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5.
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search.
4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What’s the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn’t share much.
5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards.
This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage — the fraction of problems solved by any attempt — scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.

https://aicoffeebreakwl.substack.com/p/how-openai-made-o1-think

At the end of the day, OpenAI o1 is still an LLM, and as such, it’s not without its faults. You’ll still see some funny failures and hallucinations here and there. These are specific to LLMs, which o1 still is. Despite being trained with reinforcement learning to generate useful CoT tokens, o1 is still an LLM producing the most next likely token; the correctness and usefulness of this next token was rated by reward models. Only that the reward models that evaluate o1’s outputs are also LLM-based, meaning they, too, are susceptible to mistakes.
But with improvements in coding, math, and reasoning tasks, o1 could become an indispensable tool in scientific and technical fields. As always, make sure to apply a healthy dose of expertise when using any LLM and stay curious about how these technologies evolve.

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even stateof-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model

Reasoning models // how do they work, why should we care

Reasoning is a core feature of intelligence, but how can it be imitated at the scale of cutting-edge hardware and large models?

Written by sbagency

Responses (1)