LLMs Reasoning for Math and Science // more productive research

Any research involves two phases: hypothesis generation and hypothesis testing

6 min readDec 14, 2024

We recently had the pleasure of hosting a discussion between Professor Terence Tao, Mark Chen, and James Donovan on the future of mathematics and artificial intelligence. The conversation shed light on the exciting possibilities that emerge when we combine human mathematical expertise with the capabilities of AI.

One of the key takeaways from the conversation was the need for a more modular approach to mathematics, allowing different individuals to contribute to a larger project without requiring a single person to be an expert in all aspects. This approach could potentially enable larger-scale collaborations and accelerate progress in the field.

The discussion also touched on the role of theorem provers and formalization in mathematics, as well as the potential benefits of using AI in other scientific applications, such as physics and biology.

Regarding the future of mathematics education, Professor Tao emphasized the importance of flexibility and adaptability in the face of technological advancements. Mark Chen noted that technical experts in various fields will continue to be essential in synergizing with AI tools.

When asked about the potential impact of AI on the way we approach mathematical problems, Professor Tao pointed out that the ability to recognize patterns and generate conjectures could become increasingly important. Mark Chen added that AI may excel in data-scarce environments, allowing it to make predictions or provide insights based on limited information.

https://x.com/LauraRuis/status/1859267739313185180

How do LLMs learn to reason from data? Are they ~retrieving the answers from parametric knowledge🦜? In our new preprint, we look at the pretraining data and find evidence against this:
Procedural knowledge in pretraining drives LLM reasoning ⚙️🔢
Since LLMs entered the stage, there has been a hypothesis prevalent:
When LLMs are reasoning, they are doing some form of approximate retrieval where they “retrieve” the answer to intermediate reasoning steps from parametric knowledge, as opposed to doing “genuine” reasoning.
This is not unreasonable to think given the trillions of tokens LLMs are trained on, their high capacity for memorisation, the well-documented issues with data contamination of evaluation benchmarks, and the prompt-dependent nature of LLM reasoning.
However, most studies don’t look at the pretraining data when they conclude models aren’t genuinely reasoning. In this project we wondered: even if the answers to reasoning steps are in the data, is the model relying on them when producing reasoning traces?
We use influence functions to estimate the effect pretraining data have on the likelihood of completions of two LLMs (7B and 35B) for factual question answering (left), and reasoning traces for simple mathematical tasks (3 tasks, one shown right).
To my surprise, we find the opposite of what I thought when we started this project:
The approach to reasoning LLMs use looks unlike retrieval, and more like a generalisable strategy synthesising procedural knowledge from many documents doing a similar form of reasoning.
This is based on an analysis of the influence of 5M pretraining documents (covering 2.5B tokens) on factual questions, arithmetic, calculating slopes, and linear equations. All in all, we did **a billion** LLM-sized gradient dot products for this work 🧮
We ask several questions: does the model rely heavily on specific documents for the completions, or are the documents more generally useful and contribute less overall? The former fits a retrieval strategy, the latter does not.
We find that the models rely less on individual documents when generating reasoning traces than when answering factual questions (indicated below by arrow thickness), and the set of documents they rely on is more general.
We also ask: do single documents contribute similarly to different questions, which would indicate they contain generalisable knowledge, or is the set of influential documents for different questions very different, which would fit more with a retrieval strategy?
We find that a document’s influence on the reasoning traces of a query is strongly predictive of that document’s influence on another query with the same mathematical task, indicating that influence picks up on procedural knowledge in documents for reasoning tasks.
What kind of data is influential for the generated reasoning traces? Do we find the answer to the question or the reasoning traces in the most influential data? If not, how otherwise are the documents related to the query?
For the factual questions, the answer often shows up as highly influential, whereas for reasoning questions it does not (see bottom row of documents below).
Also, we find evidence for code being both positively and negatively influential for reasoning!
Our findings show that models reason by applying procedural knowledge from similar cases seen during pretraining. This suggests we don’t need to cover every possible case in pretraining! Focusing on high-quality, diverse procedural data could be more effective.
We squeezed all findings in Figure 1 of the preprint, but check out the paper or blogpost for details:
Paper: https://arxiv.org/abs/2411.12580
Blogpost: https://lauraruis.github.io/2024/11/10/if.html
This project has changed my view on LLM reasoning, and I’m very excited to see how far this style of procedural generalisation can go for larger models, or potentially different pretraining data splits.
We are working on releasing some of the top and bottom documents for each query, stay tuned for this! 🔥
For now, we have all queries and a few documents (~80) in the demo: https://lauraruis.github.io/Demo/Scripts/linked.html
Finally, I want to take this opportunity to thank @max_nlp
without his guidance, endless patience, and support we would’ve never been able to do this large-scale work. This past year
@cohere
has taught me so much and I am very grateful.
If you made it this far, thanks for reading, and check out the paper or blogpost!
Paper: https://arxiv.org/abs/2411.12580
Blogpost: https://lauraruis.github.io/2024/11/10/if.html
Demo (more data to come!): https://lauraruis.github.io/Demo/Scripts/linked.html

The capabilities and limitations of Large Language Models (LLMs) have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation: train-test set separation. To overcome this, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, a document often has a similar influence across different reasoning questions within the same task, indicating the presence of procedural knowledge. We further find that the answers to factual questions often show up in the most influential data. However, for reasoning questions the answers usually do not show up as highly influential, nor do the answers to the intermediate reasoning steps. When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning.

LLMs Reasoning for Math and Science // more productive research

Any research involves two phases: hypothesis generation and hypothesis testing

Written by sbagency

No responses yet