Artificial intelligence and beyond // can modern AI models reason and plan?

The capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason, and the ability to plan are four essential characteristics of intelligent systems

sbagency
4 min readMar 8, 2024

There is another challenge - formalization of what is understanding the world, ability to remember and retrieve, reason and plan.

https://www.youtube.com/watch?v=5t1vTLU7s40

The key points discussed in this conversation are:

Yann LeCun is highly skeptical about the ability of current autoregressive large language models (LLMs) like GPT to achieve human-level or superhuman intelligence, despite their impressive language abilities. He argues that LLMs lack essential capabilities like understanding the physical world, persistent memory, reasoning, and planning.

LeCun advocates for an alternative approach called Joint Embedding Predictive Architecture (JEPA), which aims to learn abstract representations of the world from sensory data like images and videos, without trying to reconstruct every pixel detail. Methods like DINO, I-JEPA, and V-JEPA have shown promising results in learning good representations that capture intuitive physics and common sense.

LeCun believes that JEPA-like models, combined with techniques for hierarchical planning and model-predictive control, could pave the way towards systems with human-level intelligence that can understand and interact with the physical world.

He acknowledges the impressive capabilities of LLMs enabled by self-supervised learning on vast textual data but argues that language alone is insufficient to capture the full richness of human knowledge and understanding acquired through sensory experience, especially in early childhood.

LeCun discusses the issue of hallucinations in LLMs, where nonsensical outputs can occur due to the exponential drift from the training distribution as the generated text becomes longer. Fine-tuning can mitigate this for common prompts but cannot cover the long tail of possible prompts.

LeCun advocates moving beyond pure language models towards architectures that can effectively learn representations of the physical world, as a necessary step towards artificial general intelligence.

LeCun argues that current large language models only exhibit very primitive reasoning capabilities, limited by the fixed compute budget. He advocates for new architectures like energy-based models that can devote more compute to harder problems through optimization processes over abstract representations. However, he is skeptical that AGI is imminent, expecting it to arrive gradually over many years as various reasoning capabilities are developed.

LeCun pushes back against AI doomers, arguing advanced AI will emerge iteratively allowing for corrections, and that such systems can be imbued with appropriate guardrails through their training objectives. He sees open-sourcing foundation models as vital for enabling a diversity of AI assistants representing different cultures, values and use cases.

Overall, LeCun expresses hope that smarter AI assistants will augment and empower humans, much like the transformative effect of the printing press, if the technology remains open and democratized rather than centralized.

https://arxiv.org/pdf/2403.04121.pdf

The paper discusses the reasoning and planning capabilities of large language models (LLMs) like GPT-3 and GPT-4. The author argues that despite claims made in various papers, LLMs do not actually possess true reasoning or planning abilities as conventionally understood. Instead, LLMs excel at approximate retrieval from their vast training data, which can sometimes give the illusion of reasoning.

The author’s research group found that obfuscating object and action names in planning problems caused a drastic drop in GPT-4’s performance, suggesting it relies heavily on approximate retrieval rather than actual reasoning. The author is skeptical of claims that LLMs can self-improve through iterative prompting, attributing apparent successes to the human guide inadvertently steering the LLM (the “Clever Hans” effect).

While questioning LLMs’ autonomous reasoning abilities, the author acknowledges their strength in idea generation and extracting relevant knowledge. He proposes leveraging this capability through “LLM-Modulo” frameworks, where LLMs propose candidate solutions that are then verified by external model-based solvers or human experts. Overall, the author argues that LLMs should be viewed as powerful approximate retrieval systems rather than being ascribed true reasoning capabilities.

There is no reasoning in LLMs, just generation based on semantically close examples. Wrong or absence of relevant template leads to so-called hallucinations, but actually inability to generate new knowledge. There is a hallucination benchmark.

https://arxiv.org/pdf/2403.03558v1.pdf

Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model’s ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet