Advanced RAG techniques // for better accuracy

Retrieval augmented generation (RAG) is a core function of modern AI pipelines

sbagency
9 min readJun 23, 2024
https://www.elastic.co/search-labs/blog/elasticsearch-cohere-rerank

So briefly, what is reranking? Rerankers take the ‘top n’ search results from existing vector search and keyword search systems, and provide a semantic boost to those results. With good reranking in place, you have better ‘top n’ results without requiring you to change your model or your data indexes — ultimately providing better search results you can send to large language models (LLMs) as context.

https://www.philschmid.de/fine-tune-llms-in-2024-with-trl
https://www.microsoft.com/en-us/research/publication/from-local-to-global-a-graph-rag-approach-to-query-focused-summarization/

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a naïve RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag(opens in new tab).

https://arxiv.org/pdf/2406.12430

In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, dbest, for a decision-making question Q, business rules R and a database D. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-thenretrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https: //github.com/myeon9h/PlanRAG.

This paper introduces a new task called Decision QA, which aims to use large language models (LLMs) for complex decision-making that requires data analysis. The key points are:

1. Decision QA task: Given a decision-making question, business rules, and a database, the goal is to determine the best decision.

2. DQA benchmark: The authors created a benchmark called DQA with two scenarios (Locating and Building) based on video games that simulate business situations. It contains 301 question-database pairs.

3. PlanRAG technique: They propose a new Retrieval-Augmented Generation (RAG) technique called PlanRAG, which extends iterative RAG by adding planning and re-planning steps.

4. Methodology: PlanRAG involves three main steps: planning, retrieving & answering, and re-planning. It aims to make more effective decisions by first creating a plan for data analysis.

5. Experiments: PlanRAG outperformed existing RAG techniques, improving accuracy by 15.8% for the Locating scenario and 7.4% for the Building scenario compared to iterative RAG.

6. Analysis: The paper includes detailed analysis of the results, including performance on different question types, database types, and error categories.

7. Limitations: The authors acknowledge limitations such as focusing only on graph and relational databases, and not exploring low-level methods for solving Decision QA.

8. Ethical considerations: The paper discusses potential biases in the data and steps taken to mitigate them, as well as licensing considerations for the video game data used.

The research aims to advance the use of LLMs in complex decision-making scenarios that require data analysis and planning.

https://x.com/omarsar0/status/1803262374574448757
https://arxiv.org/pdf/2406.12824

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn’t clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take “shortcut” and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced “shortcut” behaviour true across both LLaMa and Phi family of models.

https://www.lamini.ai/blog/lamini-memory-tuning

TLDR:

Lamini Memory Tuning is a new way to embed facts into LLMs that improves factual accuracy and reduces hallucinations to previously unachievable levels — for one Fortune 500 customer, Lamini Memory Tuning led to 95% accuracy compared to 50% with other approaches. Hallucinations were reduced from 50% to 5%.

Lamini Memory Tuning is a research breakthrough that overcomes a seeming paradox in the AI world: achieving precise factual accuracy (i.e. no hallucinations) while upholding the generalization capabilities that make LLMs valuable in the first place.

The method entails tuning millions of expert adapters (e.g. LoRAs) with precise facts on top of any open-source LLM, like Llama 3 or Mistral 3. If the goal is to get Roman Empire facts exactly right, Lamini Memory Tuning would create experts on Caesar, aqueducts, legions, and any other facts you provide. Inspired by information retrieval, the model retrieves only the most relevant experts from an index at inference time — not all the model weights — so latency and cost are dramatically lower. High accuracy, high speed, low cost: with Lamini Memory Tuning, you don’t have to choose.

Contact us to try Lamini Memory Tuning.

https://arxiv.org/pdf/2406.05085

Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, synthetic datasets, and real-world use cases to demonstrate MRAG’s effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores. Website & code: https://github.com/spcl/MRAG

https://github.com/infoslack/qdrant-example/blob/main/self-querying/self-querying.ipynb
https://www.datacamp.com/tutorial/knowledge-graph-rag
https://arxiv.org/pdf/2406.07348

Retrieval-Augmented Generation (RAG) has recently demonstrated the performance of Large Language Models (LLMs) in the knowledgeintensive tasks such as Question-Answering (QA). RAG expands the query context by incorporating external knowledge bases to enhance the response accuracy. However, it would be inefficient to access LLMs multiple times for each query and unreliable to retrieve all the relevant documents by a single query. We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query. To mine the relevance, a two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers while maintaining efficiency. Additionally, a compact classifier is applied to two different selection strategies to determine the contribution of the retrieved documents to answering the query and retrieve the relatively relevant documents. Meanwhile, DR-RAG call the LLMs only once, which significantly improves the efficiency of the experiment. The experimental results on multi-hop QA datasets show that DR-RAG can significantly improve the accuracy of the answers and achieve new progress in QA systems

https://www.elastic.co/search-labs/blog/rag-playground-introduction

So briefly, what is reranking? Rerankers take the ‘top n’ search results from existing vector search and keyword search systems, and provide a semantic boost to those results. With good reranking in place, you have better ‘top n’ results without requiring you to change your model or your data indexes — ultimately providing better search results you can send to large language models (LLMs) as context.

https://www.elastic.co/blog/improving-information-retrieval-elastic-stack-hybrid
https://www.zdnet.com/article/amazon-proposes-a-new-ai-benchmark-to-measure-rag/
https://arxiv.org/pdf/2405.13622

We propose a new method to measure the taskspecific accuracy of Retrieval-Augmented Large Language Models (RAG). Evaluation is performed by scoring the RAG on an automatically generated synthetic exam composed of multiple choice questions based on the corpus of documents associated with the task. Our method is an automated, cost-efficient, interpretable, and robust strategy to select the optimal components for a RAG system. We leverage Item Response Theory (IRT) to estimate the quality of an exam and its informativeness on task-specific accuracy. IRT also provides a natural way to iteratively improve the exam by eliminating the exam questions that are not sufficiently informative about a model’s ability. We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, StackExchange questions, AWS DevOps troubleshooting guides, and SEC filings. In addition, our experiments reveal more general insights into factors impacting RAG performance like size, retrieval mechanism, prompting and fine-tuning. Most notably, our findings show that choosing the right retrieval algorithms often leads to bigger performance gains than simply using a larger language model.

--

--

sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.