Advanced RAG techniques // again
Many projects, even special LLM trained primarily for RAG and structured output tasks
Why Another LLM?
With dozens of different LLMs available both through APIs as well as commercial weights, why would Vectara be interested in having their own? That is an excellent question, and there are a few reasons we wanted our own LLM. You can read more details about this on our product-oriented blog post but to summarize, MockingBird is focused on and has high quality (as we will see below) on the tasks Vectara’s customers care about, and it can be run on Vectara’s customers’ VPCs or on-premise so their critical data never leaves their environment.
Training Mockingbird
Mockingbird is trained primarily for RAG and structured output tasks.
Training datasets quality is a key success factor
In order to train an LLM that is good at this task and can handle all the scenarios mentioned above, one of the most important parts is getting training datasets that also have all of these complexities in the input and good output summaries with citations in the output. More than half of the training efforts for Mockingbird went towards producing, creating, and curating such RAG datasets in different domains and across different languages. Note that we don’t train on customer data, so your data remains secure, and private and MockingBird does not get to see it in the training process.
In response to growing enterprise concerns over data security and the quality of retrieval-augmented generation (RAG), Vectara is proud to introduce Mockingbird, an LLM fine-tuned specifically for RAG. Mockingbird achieves the world’s leading RAG output quality and hallucination mitigation, making it perfect for enterprise RAG and autonomous agent use cases.
Speculative RAG: generalist LM verify multiple RAG drafts produced by specialist LM
Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities acquired through additional instruction tuning of LLMs. In this work, we introduce SPECULATIVE RAG — a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives on the evidence while reducing input token counts per draft. This approach enhances comprehension of each subset and mitigates potential position bias over long context. Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts. Extensive experiments demonstrate that SPECULATIVE RAG achieves state-of-the-art performance with reduced latency on TriviaQA, MuSiQue, PubHealth, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth.
Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the “distraction phenomenon,” where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for finetuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of questionanswering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates an 93× reduction in compute time while improving accuracy by 43% on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.
Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy