The evolution of embeddings // most significant hack of all time
TF-IDF 1972 K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, J. Doc. 28 (1972) 11–21.
TF-IDF 1973 K.S. Jones, Index term weighting, Inf. Storage Retr. 9 (11) (1973) 619–633.
Bag of Words 1981 Z.S. Harris, Distributional structure, in: Papers on Syntax, Springer, 1981, pp. 3–22.
BoN-Grams 1994 W. Cavnar, W.B. Cavnar, J.M. Trenkle, N-gram-based text categorization, in: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94), 1994, pp. 161–175.
doc2vec 2014 Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML) — Volume 32, ICML ’14, JMLR.org, 2014, pp. II–1188–II–1196.
DAN 2015 M. Iyyer, V. Manjunatha, J. Boyd-Graber, H. Daumé III, Deep unordered composition rivals syntactic methods for text classification, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, 2015, pp. 1681–1691.
RCNN 2015 S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI ’15, AAAI Press, 2015, pp. 2267–2273.
RNNs 2015 D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: International Conference on Learning Representations (ICLR) 2015, 2014.
Skip-Thought 2015 R. Kiros, Y. Zhu, R.R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, Skip-thought vectors, in: C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 28, Curran Associates, Inc., 2015, pp. 3294–3302.
DESM 2016 E. Nalisnick, B. Mitra, N. Craswell, R. Caruana, Improving document rank- ing with dual word embeddings, in: Proceedings of the 25th International Conference Companion on World Wide Web, 2016, pp. 83–84.
DV-ngram 2016 B. Li, T. Liu, X. Du, D. Zhang, Z. Zhao, Learning document embeddings by predicting n-grams for sentiment classification of long movie reviews, in: Workshop Contribution at International Conference on Learning Representations (ICLR) 2016, 2016.
FastSent 2016 F. Hill, K. Cho, A. Korhonen, Learning distributed representations of sentences from unlabelled data, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics (ACL), San Diego, California, 2016, pp. 1367–1377.
HAN 2016 Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, 2016, pp. 1480–1489.
NVDM 2016 Y. Miao, L. Yu, P. Blunsom, Neural variational inference for text processing, in: Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML) — Volume 48, ICML ’16, JMLR.org, 2016, pp. 1727–1736.
Siamese CBoW 2016 T. Kenter, A. Borisov, M. de Rijke, Siamese CBOW: Optimizing word embed- dings for sentence representations, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics (ACL), Berlin, Germany, 2016, pp. 941–951.
CNN-LSTM 2017 Z. Gan, Y. Pu, R. Henao, C. Li, X. He, L. Carin, Learning generic sentence representations using convolutional neural networks, in: Empirical Methods in Natural Language Processing, EMNLP, 2017, pp. 2390–2400.
CNNs 2017 Y. Zhang, D. Shen, G. Wang, Z. Gan, R. Henao, L. Carin, Deconvolutional paragraph representation learning, in: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc., 2017, pp. 5438–5445.
CNNs 2017 Z. Zhu, J. Hu, Context aware document embedding, 2017, arXiv:1707.01521.
Doc2VecC 2017 M. Chen, Efficient vector representation for documents through corruption, in: International Conference on Learning Representations, ICLR, 2017.
DiSan 2018 T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, C. Zhang, DiSAN: Directional self- attention network for RNN/CNN-free language understanding, in: AAAI, 2018, pp. 5446–5455.
ELMo 2018 M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettle- moyer, Deep contextualized word representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Associa- tion for Computational Linguistics (ACL), New Orleans, Louisiana, 2018, pp. 2227–2237.
GPT-2 2018 A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners, OpenAI Blog (2018).
ReSan 2018 T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, C. Zhang, Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI ’18, AAAI Press, 2018, pp. 4345–4352.
Sent2vec 2018 M. Pagliardini, P. Gupta, M. Jaggi, Unsupervised learning of sentence embed- dings using compositional n-gram features, in: Proceedings of North American Chapter of the Association for Computational Linguistics NAACL-HLT, 2018, pp. 528–540.
BART 2019 Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).
BERT 2019 J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidi- rectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics (ACL), Minneapolis, Minnesota, 2019, pp. 4171–4186.
DistilBERT 2019 V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, in: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing at NeurIPS 2019, 2019.
DocBERT 2019 A. Adhikari, A. Ram, R. Tang, J. Lin, DocBERT: BERT for document classification, 2019, ArXiv abs/1904.08398.
LASER 2019 M. Artetxe, H. Schwenk, Massively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond, Trans. Assoc. Comput. Linguist. 7 (2019) 597–610.
MASS 2019 K. Song, X. Tan, T. Qin, J. Lu, T. Liu, MASS: Masked sequence to sequence pre-training for language generation, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9–15 June 2019, Long Beach, California, USA, in: Proceedings of Machine Learning Research, vol. 97, PMLR, 2019, pp. 5926–5936.
SBERT 2019 N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3982–3992.
Transformer-XL 2019 Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, R. Salakhutdinov, Transformer- XL: Attentive language models beyond a fixed-length context, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL, Association for Computational Linguistics, Florence, Italy, 2019, pp. 2978–2988.
VLAWE 2019 R.T. Ionescu, A. Butnaru, Vector of locally-aggregated word embeddings (VLAWE): A novel document-level representation, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics (ACL), Minneapolis, Minnesota, 2019, pp. 363–369.
XLM 2019 A. Conneau, G. Lample, Cross-lingual language model pretraining, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019, pp. 7059–7069.
XLNet 2019 Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, XLNet: Generalized autoregressive pretraining for language understanding, in: H. Wal- lach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019, pp. 5753–5763.
ALBERT 2020 Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for self-supervised learning of language representations, in: International Conference on Learning Representations, ICLR, OpenReview.net, 2020.
ELECTRA 2020 Clark, Kevin, et al. “Electra: Pre-training text encoders as discriminators rather than generators.” arXiv preprint arXiv:2003.10555 (2020).
P-SIF 2020 V. Gupta, A. Saw, P. Nokhiz, P. Netrapalli, P. Rai, P. Talukdar, P-SIF: Document embeddings using partition averaging, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 7863–7870.
P-SIF 2020 V. Gupta, A. Kumar, P. Nokhiz, H. Gupta, P. Talukdar, Improving docu- ment classification with multi-sense embeddings, in: European Conference on Artificial Intelligence (ECAI) 2020, IOS Press, 2020, pp. 2030–2037.
RoBERTa 2020 Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L.Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretrainingapproach, in: Under Review as a Conference Paper at International Conference on Learning Representations (ICLR) 2020, 2020.
SpanBERT 2020 M. Joshi, D. Chen, Y. Liu, D. Weld, L. Zettlemoyer, O. Levy, SpanBERT: Improving pre-training by representing and predicting spans, Trans. Assoc. Comput. Linguist. 8 (2020).
SimCSE 2021 Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552.
AugCSE 2022 Tang, Zilu, Muhammed Yusuf Kocyigit, and Derry Wijaya. “Augcse: Contrastive sentence embedding with diverse augmentations.” arXiv preprint arXiv:2210.13749 (2022).
DiffCSE 2022 Oh, Dongsuk, et al. “Don’t Judge a Language Model by Its Last Layer: Contrastive Learning with Layer-Wise Attention Pooling.” arXiv preprint arXiv:2209.05972 (2022).
SGPT 2022 Muennighoff, Niklas. “Sgpt: Gpt sentence embeddings for semantic search.” arXiv preprint arXiv:2202.08904 (2022).
bge 2023 C-Pack: Packaged Resources To Advance General Chinese Embedding
embeddings-v2 2023 Günther, Michael, et al. “Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents.” arXiv preprint arXiv:2310.19923 (2023).
Text embedding models have emerged as powerful tools for transforming sentences into fixedsized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings v2, an open-source text embedding model1 capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings v2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.
This paper presents E5 1 , a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40× more parameters.
Embeddings in a RAG
Text embedding models have emerged as powerful tools for transforming sentences into fixedsized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings v2, an open-source text embedding model1 capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings v2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.