Advanced LLM-based Reasoning // so close to truly intelligent systems

Curious how advanced oX-models work? // let’s see what “more compute” can do?

8 min readJan 4, 2025

LLMs can provide basic language processing for mimicking of how humans think. Plus the biggest advantage of computers is the ability to test an enormous number of variants per second — brute force the problem. Well known search optimizations play a crucial role to make it feasible and useful. But one piece is still missing..

The problem is that models are just seq2seq approximators, reasoning is required to handle/solve new (unknown) tasks.

OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-level performances on many challanging tasks that require strong reasoning ability. OpenAI has claimed that the main techinique behinds o1 is the reinforcement learining (OpenAI, 2024a;b). Recent works use alternative approaches like knowledge distillation to imitate o1’s reasoning style, but their effectiveness is limited by the capability ceiling of the teacher model. Therefore, this paper analyzes the roadmap to achieving o1 from the perspective of reinforcement learning, focusing on four key components: policy initialization, reward design, search, and learning. Policy initialization enables models to develop human-like reasoning behaviors, equipping them with the ability to effectively explore solution spaces for complex problems. Reward design provides dense and effective signals via reward shaping or reward modeling, which is the guidance for both search and learning. Search plays a crucial role in generating high-quality solutions during both training and testing phases, which can produce better solutions with more computation. Learning utilizes the data generated by search for improving policy, which can achieve the better performance with more parameters and more searched data. Existing open-source projects that attempt to reproduce o1 can be seem as a part or a variant of our roadmap. Collectively, these components underscore how learning and search drive o1’s advancement, making meaningful contributions to the development of LLM.

https://platform.openai.com/docs/guides/reasoning

How reasoning works
The o1 models introduce reasoning tokens. The models use these reasoning tokens to “think”, breaking down their understanding of the prompt and considering multiple approaches to generating a . After generating reasoning tokens, the model produces an answer as visible completion tokens, and discards the reasoning tokens from its context.
Here is an example of a multi-step conversation between a user and an assistant. Input and output tokens from each step are carried over, while reasoning tokens are discarded.

https://www.deeplearning.ai/short-courses/reasoning-with-o1/

This paper presents a critical examination of current approaches to replicating OpenAI’s O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work (Part 1 (Qin et al., 2024)) explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1’s API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial “bitter lesson”: while the pursuit of more capable AI systems is important, the development of researchers grounded in firstprinciples thinking is paramount. This educational imperative represents not just a technical consideration, but a fundamental human mission that will shape the future of AI innovation.1 Relevant resources will be available at https://github.com/GAIR-NLP/O1-Journey.

Long-range tasks require reasoning over long inputs. Existing solutions either need large compute budgets, training data, access to model weights, or use complex, task-specific approaches. We present PRISM, which alleviates these concerns by processing information as a stream of chunks, maintaining a structured incontext memory specified by a typed hierarchy schema. This approach demonstrates superior performance to baselines on diverse tasks while using at least 4x smaller contexts than longcontext models. Moreover, PRISM is tokenefficient. By producing short outputs and efficiently leveraging key-value (KV) caches, it achieves up to 54% cost reduction when compared to alternative short-context approaches. The method also scales down to tiny information chunks (e.g., 500 tokens) without increasing the number of tokens encoded or sacrificing quality. Furthermore, we show that it is possible to generate schemas to generalize our approach to new tasks with minimal effort.

This paper introduces PRISM, a method for handling long-range tasks with short-context Large Language Models (LLMs). PRISM processes information in chunks, maintaining a structured, in-context memory defined by a typed hierarchy schema. Key features include:

Incremental Processing: Information is processed sequentially in chunks, with a structured memory accumulating relevant information from previous chunks.

Structured Memory: Instead of natural language, the LLM interacts with a structured memory defined by a user-specified typed hierarchy schema (e.g., JSON). This helps the LLM focus on relevant information.

Programmatic Memory Revisions: The LLM outputs proposed revisions to the memory (add or update a specific path with a given value) rather than overwriting the whole memory, reducing output tokens.

Token Efficiency: PRISM leverages key-value (KV) caching by reusing activations of unchanged parts of the memory, leading to lower encoding costs. The authors propose using “amendments,” which improve cache efficiency further, though at a cost of larger memory.

Task Agnostic: PRISM can be applied to various long-range tasks by defining different memory schemas.

Automatic Schema Generation: The method is flexible as memory schemas can be generated by an LLM given a task description, reducing the need for human expertise.

Scalability: The method scales down to small chunk sizes without a significant increase in cost because of key-value caching.

The authors evaluated PRISM on three datasets: BooookScore (book summarization), RepoQA (code retrieval), and LOFT-Spider (database question answering).

The method demonstrated superior performance over short-context baselines, achieving comparable results to much larger long-context LLMs on summarization tasks using only a fraction of the context length. PRISM also showed significant token and cost savings by leveraging KV caching and programmatic memory revisions. They found that “amendments” improved cache hit rates. They also explored the trade-off between chunk size, cost, and accuracy and found that they could reduce chunk size without a significant cost increase. Finally, experiments using LLM-generated schemas performed comparably to those using hand-crafted schemas.

Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning, particularly in novel domains and complex logical sequences. This research introduces PROOF OF THOUGHT, a framework that enhances the reliability and transparency of LLM outputs. Our approach bridges LLM-generated ideas with formal logic verification, employing a custom interpreter to convert LLM outputs into First Order Logic constructs for theorem prover scrutiny. Central to our method is an intermediary JSON-based Domain-Specific Language, which by design balances precise logical structures with intuitive human concepts. This hybrid representation enables both rigorous validation and accessible human comprehension of LLM reasoning processes. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge, and a flexible architecture that allows for easy extension to various domain-specific applications. We demonstrate PROOF OF THOUGHT’s effectiveness through benchmarking on StrategyQA and a novel multimodal reasoning task, showing improved performance in open-ended scenarios. By providing verifiable and interpretable results, our technique addresses critical needs for AI system accountability and sets a foundation for human-in-the-loop oversight in high-stakes domains.

“All our knowledge begins with the senses, proceeds then to the understanding, and ends with reason. There is nothing higher than reason.” — Immanuel Kant, in Critique of Pure Reason

— -

PROOF OF THOUGHT bridges the gap between language models’ flexibility and formal logic’s rigor, offering a promising solution for trustworthy reasoning in vision-language models. By enhancing interpretability and providing reasoning guarantees, PoT addresses critical challenges in AI system accountability and reliability. Our results demonstrate its potential in both natural language and multimodal reasoning tasks, paving the way for more transparent, verifiable AI systems capable of complex reasoning in high-stakes domains.

Advanced LLM-based Reasoning // so close to truly intelligent systems

Curious how advanced oX-models work? // let’s see what “more compute” can do?

Written by sbagency

No responses yet