LLMs reasoning failures// again
LLMs are surprisingly brittle to the ordering of the premises, new study shows, but not a surprise actually
Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model’s accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.
1. Presenting “If A then B” before “If B then C” in the prompt generally achieves a higher accuracy compared to the reversed order.
2. The performance gap is more significant when the number of premises increases.
The main points of this paper are (summary):
1. Large language models (LLMs) show impressive reasoning performance on various tasks, but they exhibit a surprising brittleness to the ordering of premises, despite the ordering not changing the underlying reasoning task.
2. The authors evaluate the effect of premise ordering on LLMs’ performance for deductive logical reasoning and mathematical word problems. They find that permuting the premise order can cause over 30% drop in accuracy compared to presenting premises in the order that aligns with the reasoning steps.
3. LLMs perform best when premises are arranged in the forward order that follows the ground truth proof. They generally prefer the backward order (reverse of forward) over random permutations.
4. The premise ordering effect is amplified when irrelevant premises are included, as LLMs struggle with premise selection.
5. The authors release R-GSM, a benchmark for evaluating the ordering effect on mathematical reasoning by reordering premises in GSM8K problems.
6. The findings highlight that despite premise order being immaterial for the reasoning task itself, it significantly impacts LLM performance, suggesting they are better at reasoning via sequential left-to-right processing rather than referencing across premises.
The paper uncovers a key brittleness of large language models — their surprising sensitivity to premise ordering for reasoning tasks, even when the order is logically irrelevant.
LLMs have no logic reasoning, but the pre-defined template following, it can be simply tested in many ways. This paper shows it again.