“Thinking” LLMs // endless attempts to make it working

There is no thinking inside LLM, just pre-trained sequence approximation

7 min read2 days ago

When the number of pre-trained (most frequently used) sequences are big enough “the magic” begins.

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning — but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.

RELATED WORK
Reasoning in Language vs. Vectors: In this work we focus on thinking that is explicitly in natural language. Thinking in words takes advantage of the natural language understanding capability of LLMs. LLMs are trained on large pretraining corpora of human text, which contain human thoughts expressed in natural language, and this thinking ability is hence encoded into the model. While thinking in continuous values might provide more bandwith, the Transformer architecture already can compute continuous vectors as hidden states and feed them to the next layer. However, these hidden vectors do not feed back into the model at the next token, and thus are not accessible to the future lower layers (Fan et al., 2020). Word tokens on the other hand are fed back to the model immediately, i.e. during inference the previous output token is fed as input to predict the next token — making it possible to condition all future computations on them (Merrill & Sabharwal, 2024). Another advantage of word tokens is that there exist simple sampling mechanisms which allow thoughts to take different paths each time (Wang et al., 2023), which can be used to improve results e.g. via majority vote.
Chain-of-Thought (CoT): CoT prompting (Wei et al., 2022) demonstrated that LLMs perform better at reasoning tasks when they are encouraged to write down intermediate reasoning steps. Since the type of thinking in CoT is dictated by the prompt instruction, there are now many different variations of it facilitating different types of reasoning, such as decomposing into smaller problems (Zhou et al., 2023). It is now widely used for math and reasoning tasks, and most current LLMs are finetuned to do CoT by default for those types of tasks (Dubey et al., 2024). Other works like Pfau et al. (2024) show that the model equipped with CoT might be able to perform hidden thinking using filler tokens. However, CoT usage has had more limited use in other types of tasks. Meta-analysis by Sprague et al. (2024) found that CoT techniques have little benefit outside of math and logic related tasks.
Training to Think: There have been other previous efforts to train LLMs to think. Nye et al. (2021) trained a model to write intermediate calculations into a scratchpad section before writing the final answer, which improved performance in math and coding tasks. Similarly Lehnert et al. (2024) showed that Transformers can solve complex planning tasks if they are trained to write A* search traces before outputting the solution. However, these methods rely on supervised training so ground-truth thought data is required. STaR (Zelikman et al., 2022) removes this constraint by generating both thought and answer from a model using few-shot prompting. Then the generations are filtered by the correctness of the answer to be used for supervised finetuning. It also has an option to feed correct answers to the model to generate better thought candidates. It was applied to multichoice reasoning and math tasks where the correct answers were available. Its generalization QuietSTaR (Zelikman et al., 2024) aims to insert thought segments into unstructured text. This involves sampling a sequence of thought tokens after every input token, then training using a REINFORCE based loss that optimizes the likelihood of subsequent input tokens. While it showed promising results in multi-choice reasoning and math tasks, the training mechanism is complex and compute heavy. V-STaR (Hosseini et al., 2024) trained a DPO verifier on both correct and incorrect solutions and uses the verifier to select the response in inference time. IRPO (Pang et al., 2024) also trains a variant of DPO on math and reasoning problems to learn CoTs, assuming access to gold labels on the training set. Similarly, Self-Notes (Lanchantin et al., 2023) allows the model to deviate from the input at any time to write its thoughts, but relied on supervised training data in symbolic tasks. None of these methods have been applied to general instruction following using LLMs.
System 2 methods: Many system 2 methods emerged in recent years that add intermediate steps at inference time before producing a final answer. Those steps typically involve prompting the model with a certain goal, such as verification of the answer (Dhuliawala et al., 2024), rephrasing user questions (Deng et al., 2023), selecting sentences to attend to (Weston & Sukhbaatar, 2023),etc. Briakou et al. (2024) developed a method for translation incorporating intermediate steps of drafting and revising. Our TPO method has a similarity with these methods in the first step because it uses prompting on the initial seed model, but then optimizes the thoughts during training iterations. In contrast, the common feature of the system 2 methods just described is their reliance on handcrafted prompts designed for a specific goal (e.g. verification), without optimizing those steps via finetuning. Concurrent work by Kumar et al. (2024a) trains models to self-correct, while Yu et al. (2024) distill system 2 methods into system 1 with supervised finetuning. Rather than focusing on general thinking, these works teach the model specific skills.

—

CONCLUSION
In this paper, we introduced Thinking LLMs, which think in natural language before writing a response for general instruction-following tasks. To train such models, we proposed a new training recipe called Thought Preference Optimization for teaching Thinking LLMs to improve their thoughts. Unlike prior methods (Snell et al., 2024; Kumar et al., 2024b), which directly supervise the thought generation process through techniques like self-correction or self-refinement, we instead provide incentives for the model to generate its own thoughts, without explicitly teaching it how to think. In our experiments, we train and evaluate the models in the general instruction following setup. The results on benchmarks show that the initial seed model and first iterations of training of the Thinking LLM perform poorly compared to the typical direct response model. However, after multiple iterations of training using TPO, our method outperforms the baseline. Further, fine-grained evaluations reveal that thinking helps in categories that are not usually associated with reasoning or chain-of-thought methods. This is an encouraging result and hopefully leads to wider adoption of Thinking LLMs in non-reasoning domains.
LIMITATIONS
We experimented with two different thought prompts, and observed some performance differences between them. It is likely that certain thought types are suited for certain tasks, and direct responses would even work better in certain situations. Therefore, training on a diverse set of thought prompts and allowing the model to switch between them could potentially lead to further improvements in performance. This would allow the model to better search the space of possible thoughts in order to learn to choose the most appropriate ones. However, we have not conducted these experiments. While we see improvement in overall performance with TPO, evaluation on GSM8K showed degraded math performance. As we discussed, this is likely due to our setup not being oriented toward such tasks. Incorporating more math instructions during training and having access to a judge capable of evaluating of their answers are likely solutions. In the current version of the method, thought lengths are purely determined by model itself. There is no steerability in terms of changing the number of thought tokens. Adding such functionality could be useful as longer thoughts increase computation and corresponding cost per user instruction. We could use techniques like Yuan et al. (2024a) for this purpose. All our experiments are based on 8B parameter sized models. However, it is worth investigating the effect of thinking on larger scale models. Given the compute requirements of such experiments, we leave that to future work.

“Thinking” LLMs // endless attempts to make it working

There is no thinking inside LLM, just pre-trained sequence approximation

Written by sbagency