Reasoning // updates

8 min readFeb 24, 2024

https://bnnbreaking.com/world/us/unraveling-ais-logic-how-order-of-information-alters-ai-reasoning-a-study-by-google-deepmind-and-stanford

LLMs have marked a revolutonary shift, yet they falter when faced with compositional reasoning tasks. Our research embarks on a quest to uncover the root causes of compositional reasoning failures of LLMs, uncovering that most of them stem from the improperly generated or leveraged implicit reasoning results. Inspired by our empirical findings, we resort to Logit Lens and an intervention experiment to dissect the inner hidden states of LLMs. This deep dive reveals that implicit reasoning results indeed surface within middle layers and play a causative role in shaping the final explicit reasoning results. Our exploration further locates multi-head self-attention (MHSA) modules within these layers, which emerge as the linchpins in accurate generation and leveraing of implicit reasoning results. Grounded on the above findings, we develop CREME, a lightweight method to patch errors in compositional reasoning via editing the located MHSA modules. Our empirical evidence stands testament to CREME’s effectiveness, paving the way for autonomously and continuously enhancing compositional reasoning capabilities in language models.

The development of artificial intelligence systems with advanced reasoning capabilities represents a persistent and long-standing research question. Traditionally, the primary strategy to address this challenge involved the adoption of symbolic approaches, where knowledge was explicitly represented by means of symbols and explicitly programmed rules. However, with the advent of machine learning, there has been a paradigm shift towards systems that can autonomously learn from data, requiring minimal human guidance. In light of this shift, in latest years, there has been increasing interest and efforts at endowing neural networks with the ability to reason, bridging the gap between data-driven learning and logical reasoning. Within this context, Neural Algorithmic Reasoning (NAR) stands out as a promising research field, aiming to integrate the structured and rule-based reasoning of algorithms with the adaptive learning capabilities of neural networks, typically by tasking neural models to mimic classical algorithms. In this dissertation, we provide theoretical and practical contributions to this area of research. We begin with a review of foundational principles necessary for the understanding of the notions presented in this thesis, followed by a comprehensive overview of the most important NAR principles. Proceeding forward, we explore the connections between neural networks and tropical algebra, deriving powerful architectures that are aligned with algorithm execution. These architectures are thus proven to be able to approximate some min-aggregated dynamic programming algorithms up to arbitrary precision. Furthermore, we discuss and show the ability of such neural reasoners to learn and manipulate complex algorithmic and combinatorial optimization concepts, such as the principle of strong duality. Here, we rigorously evaluate this capacity through extensive quantitative and qualitative studies. Finally, in our empirical efforts, we validate the real-world utility of NAR networks across different practical scenarios. This includes tasks as diverse as planning problems, large-scale edge classification tasks and the learning of polynomial-time approximate algorithms for NP-hard combinatorial optimisation problems. Through this comprehensive exploration, we aim to showcase the potential and versatility of integrating algorithmic reasoning in machine learning models.

Large Language Models (LLMs) excel in generating personalized content and facilitating interactive dialogues, showcasing their remarkable aptitude for a myriad of applications. However, their capabilities in reasoning and providing explainable outputs, especially within the context of reasoning abilities, remain areas for improvement. In this study, we delve into the reasoning abilities of LLMs, highlighting the current challenges and limitations that hinder their effectiveness in complex reasoning scenarios.

Here is a summary of the key points from the paper:

The document discusses evaluating large language models (LLMs) on their ability to perform commonsense reasoning. Several benchmark datasets are introduced that test different aspects of commonsense reasoning:

- MCTACO focuses on temporal/sequential reasoning

- PIQA tests understanding of everyday physical phenomena

- CoS-E adds human explanations to CommonsenseQA to explain the reasoning

- StepGame evaluates spatial reasoning from text descriptions

- Cosmos QA requires commonsense reasoning for reading comprehension

- Winogrande uses Winograd schema questions to avoid dataset biases

- CommonsenseQA specifically utilizes the ConceptNet knowledge graph

The document notes there are still gaps between human and LLM performance on these benchmarks. It suggests approaches to improve commonsense reasoning abilities in LLMs, including integrating external knowledge sources like ConceptNet, using sophisticated models and training methodologies, and developing more advanced prompts and fine-tuning techniques. Overall, the document examines the progress in commonsense reasoning for LLMs but highlights the need for continued research to reach human-like capabilities.

BE Aware of Reasoning Shortcuts // BEARS

Neuro-Symbolic (NeSy) predictors that conform to symbolic knowledge — encoding, e.g., safety constraints — can be affected by Reasoning Shortcuts (RSs): They learn concepts consistent with the symbolic knowledge by exploiting unintended semantics. RSs compromise reliability and generalization and, as we show in this paper, they are linked to NeSy models being overconfident about the predicted concepts. Unfortunately, the only trustworthy mitigation strategy requires collecting costly dense supervision over the concepts. Rather than attempting to avoid RSs altogether, we propose to ensure NeSy models are aware of the semantic ambiguity of the concepts they learn, thus enabling their users to identify and distrust low-quality concepts. Starting from three simple desiderata, we derive bears (BE Aware of Reasoning Shortcuts), an ensembling technique that calibrates the model’s concept-level confidence without compromising prediction accuracy, thus encouraging NeSy architectures to be uncertain about concepts affected by RSs. We show empirically that bears improves RS-awareness of several state-of-the-art NeSy models, and also facilitates acquiring informative dense annotations for mitigation purposes.

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs’ reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBENCH, a language-driven environment composing 10 widely-recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we investigate two key problems: ➊ Characterizing game-theoretic reasoning of LLMs; ➋ LLM-vs-LLM competitions as reasoning evaluation. We observe that ➊ LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; ➋ Open-source LLMs, e.g., CodeLlama-34b-Instruct, are less competitive than commercial LLMs, e.g., GPT-4, in complex games. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. Detailed error profiles are also provided for a better understanding of LLMs’ behavior.

Large Language Models (LLMs) have garnered significant attention for their ability to understand text and images, generate human-like text, and perform complex reasoning tasks. However, their ability to generalize this advanced reasoning with a combination of natural language text for decisionmaking in dynamic situations requires further exploration. In this study, we investigate how well LLMs can adapt and apply a combination of arithmetic and common-sense reasoning, particularly in autonomous driving scenarios. We hypothesize that LLMs’ hybrid reasoning abilities can improve autonomous driving by enabling them to analyze detected object and sensor data, understand driving regulations and physical laws, and offer additional context. This addresses complex scenarios, like decisions in low visibility (due to weather conditions), where traditional methods might fall short. We evaluated Large Language Models (LLMs) based on accuracy by comparing their answers with human-generated ground truth inside CARLA. The results showed that when a combination of images (detected objects) and sensor data is fed into the LLM, it can offer precise information for brake and throttle control in autonomous vehicles across various weather conditions. This formulation and answers can assist in decision-making for auto-pilot systems.

“The results showed that when a combination of images (detected objects) and sensor data is fed into the LLM, it can offer precise information for brake and throttle control in autonomous vehicles across various weather conditions.” — Double check that.

Fact checking aims to predict claim veracity by reasoning over multiple evidence pieces. It usually involves evidence retrieval and veracity reasoning. In this paper, we focus on the latter, reasoning over unstructured text and structured table information. Previous works have primarily relied on finetuning pretrained language models or training homogeneousgraph-based models. Despite their effectiveness, we argue that they fail to explore the rich semantic information underlying the evidence with different structures. To address this, we propose a novel word-level Heterogeneous-graph-based model for Fact Checking over unstructured and structured information, namely HeterFC. Our approach leverages a heterogeneous evidence graph, with words as nodes and thoughtfully designed edges representing different evidence properties. We perform information propagation via a relational graph neural network, facilitating interactions between claims and evidence. An attention-based method is utilized to integrate information, combined with a language model for generating predictions. We introduce a multitask loss function to account for potential inaccuracies in evidence retrieval. Comprehensive experiments on the large fact checking dataset FEVEROUS demonstrate the effectiveness of HeterFC. Code will be released at: https://github.com/Deno-V/HeterFC.

Evidence theory is widely used in decision-making and reasoning systems. In previous research, Transferable Belief Model (TBM) is a commonly used evidential decision making model, but TBM is a non-preference model. In order to better fit the decision making goals, the Evidence Pattern Reasoning Model (EPRM) is proposed. By defining pattern operators and decision making operators, corresponding preferences can be set for different tasks. Random Permutation Set (RPS) expands order information for evidence theory. It is hard for RPS to characterize the complex relationship between samples such as cycling, paralleling relationships. Therefore, Random Graph Set (RGS) were proposed to model complex relationships and represent more event types. In order to illustrate the significance of RGS and EPRM, an experiment of aircraft velocity ranking was designed and 10,000 cases were simulated. The implementation of EPRM called Conflict Resolution Decision optimized 18.17% of the cases compared to Mean Velocity Decision, effectively improving the aircraft velocity ranking. EPRM provides a unified solution for evidence-based decision making.

Reasoning // updates

Written by sbagency

No responses yet