Unfortunately , too few people understand the distinction between memorization and understanding. It’s not some lofty question like “does the system have an internal world model?”, it’s a very pragmatic behavior distinction: “is the system capable of broad generalization, or is it limited to local generalization?”
LLMs have failed every single benchmark and experiment focused on generalization, since their inception. It’s not just ARC — this is documented in literally hundreds, possibly thousands of papers. The ability of LLMs to solve a task is entirely dependent of their familiarity with the task (local generalization).
As a result, the only avenue available to increase LLM performance on new tasks / situations is to train them on more data — millions of times more data than available to a human. But no matter how much data you train on, there will always be never-seen-before tasks and situations, where LLMs will stumble.
These arguments are incredibly tired. If you didn’t get it in 2017, you’re not going to get it now.
Ok, by popular demand: a starter set of papers you can read on the topic.
“Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks”: https://arxiv.org/abs/2311.09247
“Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve”: https://arxiv.org/abs/2309.13638
“Faith and Fate: Limits of Transformers on Compositionality”: https://arxiv.org/abs/2305.18654
“The Reversal Curse: LLMs trained on “A is B” fail to learn ‘B is A’”: https://arxiv.org/abs/2309.12288
“On the measure of intelligence”: https://arxiv.org/abs/1911.01547 not about LLMs, but provides context and grounding on what it means to be intelligent and the nature of generalization. It also introduces an intelligence benchmark (ARC) that remains completely out of reach for LLMs. Ironically the best-performing LLM-based systems on ARC are those that have been trained on tons of generated tasks, hoping to hit some overlap between test set tasks and your generated tasks — LLMs have zero ability to tackle an actually new task.
In general there’s a new paper documenting the lack of broad generalization capabilities of LLMs every few days.