Large Concept Model // LCM
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-theart NLLB 1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.
To conclude, we introduced a new multilingual and multimodal sentence embedding space called SONAR. We conducted an extensive study on objective functions to build our multilingual teacher sentence embedding space for text, and an extensive evaluation of our SONAR framework for both similarity search and decoding tasks. We extended this new text sentence embedding space to the speech modality to introduce Sentencelevel multimOdal and laNguage-Agnostic Representations (SONAR). The SONAR text and speech encoders as well as the text decoders are freely available at https://github.com/ facebookresearch/SONAR.
Meta has introduced a new concept called the Large Concept Model (LCM), which aims to abstract language away from communication, allowing for more efficient and cost-effective processing of multiple languages on social media platforms. The LCM operates on a higher level of semantic representation, using a mathematical concept of the content of a message, rather than a language-specific representation. This approach enables the model to learn from multiple languages and modalities, aggregating knowledge into a mathematical space.
The LCM uses a combination of a diffusion-based architecture and a Transformer-based encoder to refine sentence embeddings and produce coherent outputs. The diffusion process involves adding noise to the embeddings and then refining them through a series of iterations, allowing the model to handle uncertain or noisy inputs. The Transformer encoder plays a crucial role in capturing the nuanced meaning of sentences, making it suitable for tasks like reasoning and multimodal integration. However, the LCM has limitations, including the need for robust automatic text segmentation techniques and the potential for decreased performance when dealing with long or complex sentences. Despite these limitations, the LCM presents a fascinating approach to natural language processing, and further development and research are needed to fully explore its potential.