Advanced RAG techniques // for better accuracy

Retrieval augmented generation (RAG) is a core function of modern AI pipelines

sbagency
7 min readJun 23, 2024
https://arxiv.org/pdf/2406.12430

In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, dbest, for a decision-making question Q, business rules R and a database D. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-thenretrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https: //github.com/myeon9h/PlanRAG.

This paper introduces a new task called Decision QA, which aims to use large language models (LLMs) for complex decision-making that requires data analysis. The key points are:

1. Decision QA task: Given a decision-making question, business rules, and a database, the goal is to determine the best decision.

2. DQA benchmark: The authors created a benchmark called DQA with two scenarios (Locating and Building) based on video games that simulate business situations. It contains 301 question-database pairs.

3. PlanRAG technique: They propose a new Retrieval-Augmented Generation (RAG) technique called PlanRAG, which extends iterative RAG by adding planning and re-planning steps.

4. Methodology: PlanRAG involves three main steps: planning, retrieving & answering, and re-planning. It aims to make more effective decisions by first creating a plan for data analysis.

5. Experiments: PlanRAG outperformed existing RAG techniques, improving accuracy by 15.8% for the Locating scenario and 7.4% for the Building scenario compared to iterative RAG.

6. Analysis: The paper includes detailed analysis of the results, including performance on different question types, database types, and error categories.

7. Limitations: The authors acknowledge limitations such as focusing only on graph and relational databases, and not exploring low-level methods for solving Decision QA.

8. Ethical considerations: The paper discusses potential biases in the data and steps taken to mitigate them, as well as licensing considerations for the video game data used.

The research aims to advance the use of LLMs in complex decision-making scenarios that require data analysis and planning.

https://x.com/omarsar0/status/1803262374574448757
https://arxiv.org/pdf/2406.12824

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn’t clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take “shortcut” and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced “shortcut” behaviour true across both LLaMa and Phi family of models.

https://www.lamini.ai/blog/lamini-memory-tuning

TLDR:

Lamini Memory Tuning is a new way to embed facts into LLMs that improves factual accuracy and reduces hallucinations to previously unachievable levels — for one Fortune 500 customer, Lamini Memory Tuning led to 95% accuracy compared to 50% with other approaches. Hallucinations were reduced from 50% to 5%.

Lamini Memory Tuning is a research breakthrough that overcomes a seeming paradox in the AI world: achieving precise factual accuracy (i.e. no hallucinations) while upholding the generalization capabilities that make LLMs valuable in the first place.

The method entails tuning millions of expert adapters (e.g. LoRAs) with precise facts on top of any open-source LLM, like Llama 3 or Mistral 3. If the goal is to get Roman Empire facts exactly right, Lamini Memory Tuning would create experts on Caesar, aqueducts, legions, and any other facts you provide. Inspired by information retrieval, the model retrieves only the most relevant experts from an index at inference time — not all the model weights — so latency and cost are dramatically lower. High accuracy, high speed, low cost: with Lamini Memory Tuning, you don’t have to choose.

Contact us to try Lamini Memory Tuning.

https://arxiv.org/pdf/2406.05085

Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, synthetic datasets, and real-world use cases to demonstrate MRAG’s effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores. Website & code: https://github.com/spcl/MRAG

https://github.com/infoslack/qdrant-example/blob/main/self-querying/self-querying.ipynb
https://www.datacamp.com/tutorial/knowledge-graph-rag
https://arxiv.org/pdf/2406.07348

Retrieval-Augmented Generation (RAG) has recently demonstrated the performance of Large Language Models (LLMs) in the knowledgeintensive tasks such as Question-Answering (QA). RAG expands the query context by incorporating external knowledge bases to enhance the response accuracy. However, it would be inefficient to access LLMs multiple times for each query and unreliable to retrieve all the relevant documents by a single query. We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query. To mine the relevance, a two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers while maintaining efficiency. Additionally, a compact classifier is applied to two different selection strategies to determine the contribution of the retrieved documents to answering the query and retrieve the relatively relevant documents. Meanwhile, DR-RAG call the LLMs only once, which significantly improves the efficiency of the experiment. The experimental results on multi-hop QA datasets show that DR-RAG can significantly improve the accuracy of the answers and achieve new progress in QA systems

https://www.elastic.co/search-labs/blog/rag-playground-introduction

So briefly, what is reranking? Rerankers take the ‘top n’ search results from existing vector search and keyword search systems, and provide a semantic boost to those results. With good reranking in place, you have better ‘top n’ results without requiring you to change your model or your data indexes — ultimately providing better search results you can send to large language models (LLMs) as context.

https://www.elastic.co/blog/improving-information-retrieval-elastic-stack-hybrid

--

--

sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.