Reasoning // humans & LLMs

sbagency
3 min readNov 4, 2023

--

https://arxiv.org/pdf/2311.00445.pdf

A central component of rational behavior is logical inference: the process of determining which conclusions follow from a set of premises. Psychologists have documented several ways in which humans’ inferences deviate from the rules of logic. Do language models, which are trained on text generated by humans, replicate these biases, or are they able to overcome them? Focusing on the case of syllogisms — inferences from two simple premises, which have been studied extensively in psychology — we show that larger models are more logical than smaller ones, and also more logical than humans. At the same time, even the largest models make systematic errors, some of which mirror human reasoning biases such as ordering effects and logical fallacies. Overall, we find that language models mimic the human biases included in their training data, but are able to overcome them in some cases.

Here is a summary of the key points from the article:

- The article compares syllogistic reasoning abilities in humans and language models (LMs). Syllogisms are logical arguments with two premises relating three terms (e.g. artists, bakers, chemists).

- The authors evaluated the PaLM 2 family of LMs on 64 possible syllogisms. They found that larger LMs were more accurate than smaller LMs and humans, but even the largest LM was only about 80% accurate.

- LMs struggled with some of the same syllogisms that are difficult for humans. However, LMs could overcome human biases and solve some syllogisms much better than humans.

- Like humans, LMs showed ordering effects — their responses were influenced by the order of terms in the premises, even though order is logically irrelevant.

- LMs displayed some of the same systematic fallacies as humans, where they confidently give incorrect answers. Larger LMs were more susceptible to these fallacies.

- Using a cognitive model called Mental Models, the authors showed larger LMs have signatures of more deliberative reasoning, similar to humans.

LMs replicate many human biases in syllogistic reasoning, likely due to training on human-generated text. However, LMs can overcome some biases and are more logical than humans overall, though still far from perfectly rational. The findings suggest models acquire a mix of accurate and biased reasoning strategies.

This detailed analysis of a logical reasoning task reveals that larger neural network LMs improve in logical accuracy compared to humans but retain some human-like biases.

https://arxiv.org/pdf/2310.00741.pdf

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-ofthought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet