LLMs Hallucinations // is it fixable, how?

LLMs generate text based on patterns taken from vast datasets // no surprise — errors aka hallucinations

sbagency
10 min readJul 28, 2024

All models are wrong, but some are useful %)

LLMs hallucinations

The most practical technique to reduce hallucinations is RAG (Retrieval-Augmented Generation). Providing context with precise information can boost the accuracy of LLMs and cut down on hallucinations, especially for models with a 128K+ context window. Fine-tuning has its own risks, like catastrophic forgetting. And for sure to work with hallucinations we need to detect hallucinations.

Hallucination Detector // Oxford scientists propose effective method to detect AI hallucinations

https://www.deeplearning.ai/the-batch/oxford-scientists-propose-effective-method-to-detect-ai-hallucinations/
https://lilianweng.github.io/posts/2024-07-07-hallucination/

Here is a summary of the key points from the article on extrinsic hallucinations in large language models (LLMs):

1. Extrinsic hallucination refers to when an LLM generates content that is not grounded in or verified by external world knowledge.

2. Causes of hallucinations include issues with pre-training data and challenges with introducing new knowledge during fine-tuning.

3. Several methods exist for detecting hallucinations:
— Retrieval-augmented evaluation using external knowledge bases
— Sampling-based approaches that check for consistency across multiple model outputs
— Calibration techniques to assess model uncertainty

4. Approaches to reduce hallucinations include:
— Retrieval-augmented generation (RAG) with editing and attribution
— Chain-of-verification techniques
— Special sampling methods during inference
— Fine-tuning specifically for factuality and attribution

5. Key benchmarks for evaluating factuality and hallucination in LLMs include TruthfulQA, FactualityPrompt, and SelfAware.

6. There’s an inherent tension between factuality and other desirable properties like helpfulness — techniques that improve factuality may negatively impact other aspects of model performance.

7. Many approaches find that larger models tend to hallucinate less, but this isn’t universally true across all benchmarks and techniques.

8. Providing attributions/citations in model outputs is a promising direction for improving factuality and reducing hallucinations.

The article provides an in-depth overview of the current research landscape around understanding, detecting, and mitigating hallucinations in large language models.

“..What memory tuning is doing is taking that to the extreme and instead, and doing it on the adapter side of things. So doing it on top of the model with lauras and we tune these memory experts and we have a million memory experts or more. The idea is instead of eight, now you have a million, you essentially get a sparsely activated, heavily, sparsely activated model, so you can get a credibly, you can scale the model to be incredibly large..”

Here’s a summary of the key points from the conversation with Sharon Zhou, CEO of Lamini (by claude.ai):

1. Market overview:
— AI hype is slowing down, but enterprises are tackling deeper use cases beyond shallow applications like email composition
— Compute infrastructure and budgets are in place, but organizational structure remains a challenge
— It’s becoming a hybrid world with general-purpose models for basic tasks and custom models for proprietary differentiation

2. Lamini’s offerings:
— Integrated inference and fine-tuning platform for enterprises
— Enables running factual LLMs on proprietary data within secure environments
— Offers free inference (40 million tokens) to get customers started
— Works with various GPU types (NVIDIA, AMD) and supports multiple open-source models

3. Memory tuning:
Lamini’s breakthrough technique to reduce hallucinations in LLMs
— Brings loss to zero on specific facts, making the model deterministic for crucial information
— Uses a mixture of experts approach with millions of memory experts
— Achieved 95% accuracy (up from 50%) for a Fortune 100 company in text-to-SQL task

4. Technical concepts explained:
— LoRA (Low-Rank Adaptation): Efficient fine-tuning technique that tunes only a subset of weights
— Mixture of Experts: Approach that uses multiple specialized sub-models (experts) to handle different types of queries

5. AI agents:
— Sharon views them as object-oriented programming with LLMs as the core component
— Warns about the gap between current capabilities and expectations
— Suggests specialization and reducing the number of LLM calls for better performance

6. Future directions for Lamini:
— Reducing the speed of memory tuning from hours to minutes
— Working towards continuous fine-tuning of models
— Exploring new architectures like diffusion models for text generation

7. AI development perspective:
— Sharon leans towards improvements in compute utilization rather than adding symbolic reasoning to LLMs

The conversation highlights Lamini’s focus on improving LLM accuracy and efficiency for enterprise use cases, particularly through their memory tuning technique and optimized inference capabilities.

Lamini Memory Tuning is a new way to embed facts into LLMs that improves factual accuracy and reduces hallucinations to previously unachievable levels — for one Fortune 500 customer, Lamini Memory Tuning led to 95% accuracy compared to 50% with other approaches. Hallucinations were reduced from 50% to 5%.

Lamini Memory Tuning is a research breakthrough that overcomes a seeming paradox in the AI world: achieving precise factual accuracy (i.e. no hallucinations) while upholding the generalization capabilities that make LLMs valuable in the first place.

The method entails tuning millions of expert adapters (e.g. LoRAs) with precise facts on top of any open-source LLM, like Llama 3 or Mistral 3. If the goal is to get Roman Empire facts exactly right, Lamini Memory Tuning would create experts on Caesar, aqueducts, legions, and any other facts you provide. Inspired by information retrieval, the model retrieves only the most relevant experts from an index at inference time — not all the model weights — so latency and cost are dramatically lower. High accuracy, high speed, low cost: with Lamini Memory Tuning, you don’t have to choose.

https://arxiv.org/pdf/2406.17642

Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional approaches fail to explain why LLMs hallucinate in practice. Specifically, we show that LLMs augmented with a massive Mixture of Memory Experts (MoME) can easily memorize large datasets of random numbers. We corroborate these experimental findings with a theoretical construction showing that simple neural networks trained to predict the next token hallucinate when the training loss is above a threshold as it usually does in practice when training on internet scale data. We interpret our findings by comparing against traditional retrieval methods for mitigating hallucinations. We use our findings to design a first generation model for removing hallucinations — Lamini-1 — that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically

https://arxiv.org/pdf/2402.16063v2

Large language models (LLMs) exhibit powerful general intelligence across diverse scenarios, including their integration into chatbots. However, a vital challenge of LLMbased chatbots is that they may produce hallucinated content in responses, which significantly limits their applicability. Various efforts have been made to alleviate hallucination, such as retrieval augmented generation and reinforcement learning with human feedback, but most of them require additional training and data annotation. In this paper, we propose a novel post-hoc Citation-Enhanced Generation (CEG) approach combined with retrieval argumentation. Unlike previous studies that focus on preventing hallucinations during generation, our method addresses this issue in a post-hoc way. It incorporates a retrieval module to search for supporting documents relevant to the generated content, and employs a natural language inference-based citation generation module. Once the statements in the generated content lack of reference, our model can regenerate responses until all statements are supported by citations. Note that our method is a training-free plug-and-play plugin that is capable of various LLMs. Experiments on various hallucination-related datasets show our framework outperforms state-of-the-art methods in both hallucination detection and response regeneration on three benchmarks. Our codes and dataset will be publicly available.

Structuring information

Unfortunately, ideas to structure world information are not a magic pill, but some techniques are applicable // knowledge graphs, ontologies, etc.

https://youtu.be/t7wZbbISdyA?t=76
https://youtu.be/t7wZbbISdyA?t=112
https://youtu.be/t7wZbbISdyA?t=134
https://youtu.be/t7wZbbISdyA?t=196

Abduction == guessing based on many facts, deduction == necessary reasoning/ idealization, induction == experimental research

Knowledge soup

Here’s a summary of the key points from the presentation (claude.ai transcript summary):

1. Large language models (LLMs) are insufficient for reasoning without ontology. The speaker argues that precise control and natural language support are necessary for practical AI applications.

2. Human languages are derived from experiences and perceptions, but they are not the language of thought. There’s much more to thinking than just language, including spatial reasoning and mental models.

3. Word meanings are context-dependent and open-ended. Even in mathematics and science, precise definitions require stated axioms.

4. Visual and spatial understanding is crucial for intelligence. The speaker argues that even bird brains can perform tasks that computers cannot, such as building nests in irregular environments.

5. Sign languages demonstrate that language processing is not dependent on spoken words. They utilize 3D space and can convey complex ideas without speech.

6. In mathematics and science, visualization and mental models often precede symbolic representation. Einstein and other scientists have emphasized the importance of visual thinking.

7. The speaker advocates for a “neurocognitive” approach to AI, incorporating various branches of cognitive science, rather than just focusing on artificial neural networks and symbols.

8. The speaker presents a model for AI systems with a “central executive” that integrates various reasoning methods, including LLMs, but also incorporates other AI techniques and ethical considerations.

9. The speaker’s company is developing AI systems using this broader, neurocognitive approach, which includes LLMs but is not limited to them.

10. The presentation concludes by suggesting that future AI systems should aim to play a Socratic role, helping to draw out ideas and understanding from humans and other AI systems.

https://arxiv.org/pdf/2407.16908

Addressing the issue of hallucinations in large language models (LLMs) is a critical challenge. As the cognitive mechanisms of hallucination have been related to memory, here we explore hallucination for LLM that is enabled with explicit memory mechanisms. We empirically demonstrate that by simply scaling the readout vector that constrains generation in a memory-augmented LLM decoder, hallucination mitigation can be achieved in a training-free manner. Our method is geometry-inspired and outperforms a state-of-the-art LLM editing method on the task of generation of Wikipedia-like biography entries both in terms of generation quality and runtime complexity

Here’s a summary of the key points from the paper:

1. The paper discusses a method to mitigate hallucinations in large language models (LLMs) by scaling the readout vector that constrains generation in a memory-augmented LLM decoder.

2. The authors use Larimar, a memory-augmented LLM, and compare its performance on a hallucination benchmark with GRACE, an existing model editing technique.

3. The WikiBio dataset, containing Wikipedia-like biographies generated by GPT-3, is used as a hallucination benchmark.

4. The authors observe that by scaling up the length of the readout vector in Larimar, they can minimize its distance to the write vector and potentially reduce hallucinations.

5. Experiments show that scaling the readout vector by a factor of 3–4 leads to significant improvements in Rouge-L and Jaccard similarity scores compared to both the base Larimar model and GRACE.

6. The proposed method is training-free and computationally efficient compared to GRACE, which requires learning codebooks through iterative optimization.

7. The paper concludes that simple geometry-inspired operations on memory readouts in memory-augmented models like Larimar can be more effective at mitigating hallucinations than training-based approaches like GRACE.

8. The study demonstrates the potential of leveraging explicit memory mechanisms in LLMs to address the hallucination problem.

--

--

sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.