Best embedding models // vector search

bi-encoder and cross-encoder // top-k and re-ranking

sbagency
5 min readJul 5, 2024
https://softwaremill.com/embedding-models-comparison/

Retriever architecture

The retriever architecture consists of a bi-encoder model and optionally cross-encoder. The bi-encoder encodes the query in the form of a feature vector and compares the vector distance to other documents’ vectors inside the vector base.

Cross-encoder is used for re-ranking. It takes both the query and document text as an input and outputs the document similarity score. It is much slower, but more accurate at the same time. Therefore the bi-encoder is used for first-stage document selection and a cross-encoder is used for fine document selection. Combining the power of those two, we can get high accuracy and speed.

https://huggingface.co/spaces/mteb/leaderboard
https://softwaremill.com/embedding-models-comparison/

The re-ranking module significantly improved the custom embedding model and slightly changed the performance of the GPT model. The custom model almost reached the Gpt model performance. Worse performance in terms of re-ranking the GPT-embedding model seems, to indicate, that the cross-encoder model reached its performance limit and further improvement could be achieved only with cross-encoder finetuning on a custom dataset or with a better cross-encoder model. Taking into account that BAAI/bge-reranker-large is considered one of the best in class, fine-tuning options seems like the most promising option.

https://www.reasonfieldlab.com/post/how-to-improve-document-matching-when-designing-the-chatbot

Summary

The retrieval module is the most important component of the RAG chatbot. Good-quality document matches allow for meaningful responses from the instructions tuned LLM. Building a robust and accurate retrieval engine consists of designing a prompt, choosing and deploying the embedding module together with the vector store, and finally adding the reranking module. Each step should be evaluated, and the best-performing setup should be chosen.

https://python.langchain.com/v0.2/docs/integrations/retrievers/flashrank-reranker/
https://github.com/PrithivirajDamodaran/FlashRank
https://developer.nvidia.com/blog/nvidia-text-embedding-model-tops-mteb-leaderboard/
https://arxiv.org/pdf/2406.01607v2

Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, recent advancements in training data quantity, quality and diversity; synthetic data generation from LLMs as well as using LLMs as backbones encourage great improvements in pursuing universal text embeddings. In this paper, we provide an overview of the recent advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we offer a systematic organization of the literature, underscoring the significant developments and limitations in the recent advancements of universal text embedding models, and suggest potential future research directions that could inspire further advancements in this field.

https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/

We evaluate voyage-multilingual-2 on over 85 datasets that we collected from various sources, covering 27 languages, including English, French, German, Japanese, Spanish, Korean, Bengali, Portuguese, Russian, etc. Each of the first 6 languages has multiple datasets. The other languages involve one dataset each and are grouped into an OTHER category.

https://training.continuumlabs.ai/knowledge/vector-databases/improving-text-embeddings-with-large-language-models
https://arxiv.org/pdf/2401.00368

In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pretraining with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

https://medium.com/@leo_dev/bleeding-edge-open-source-multilingual-models-9e7553af9da1
https://huggingface.co/intfloat/multilingual-e5-large
https://arxiv.org/pdf/2407.12580

Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet