Microsoft Researchers Introduce an Innovative Artificial Intelligence Method for High-Quality Text Embeddings Using Synthetic Data. introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data

On Jan 4, 2024

Natural Language Processing (NLP) tasks extensively make use of text embeddings. Text embeddings encode semantic information contained in text by acting as vector representations of natural language. Activities such as information retrieval, question answering, semantic textual similarity, bitext mining, and item recommendation use these embeddings. Using methods like approximate closest neighbor search, text embeddings in information retrieval (IR) effectively retrieve a small group of candidate documents from a large corpus at the first retrieval stage.

Retrieval Augmented Generation (RAG), the latest paradigm that allows Large Language Models to access dynamic external knowledge without changing model parameters, likewise relies heavily on embedding-based retrieval. Text embeddings also play a crucial role in the attribution of the source of generated text, improving the interpretability and reliability of LLMs.

Prior research has shown that weighted averages of pre-trained word embeddings provide a reliable foundation for gauging semantic similarity. These techniques, however, are unable to capture the rich contextual information included in real language fully. Sentence-BERT and SimCSE are two methods that have evolved with the introduction of pre-trained language models.

These methods are used to fine-tune models like BERT on Natural Language Inference (NLI) datasets in order to learn text embeddings. More sophisticated multi-stage training paradigms are used by state-of-the-art techniques like E5 and BGE, which pre-train on weakly-supervised text pairs and fine-tune on labeled datasets to improve resilience and performance.

In recent research, a team of researchers from Microsoft Corporation has presented a unique and simple method for producing high-quality text embeddings. This new approach has achieved remarkable results using only synthetic data and a remarkably small number of training steps, which are less than 1,000. This is in contrast to existing methods that rely on multi-stage pre-training using billions of weakly-supervised text pairs and subsequent fine-tuning with limited labeled datasets. The main difference lies in not relying on labor-intensive training pipelines and manually gathered datasets, which frequently have issues with task variety and language coverage.

The method uses proprietary Large Language Models to generate a wide range of synthetic data for text embedding jobs across around 100 languages. This approach uses a basic contrastive loss to fine-tune open-source decoder-only LLMs on the generated synthetic data instead of utilizing complex pre-training stages.

The team has conducted some tests in order to verify this approach. The model has demonstrated its outstanding results on fiercely competitive text embedding benchmarks, all without using any labeled data. The model has also established itself as a state-of-the-art method in text embedding without requiring large labeled datasets when it is refined using a combination of synthetic and labeled data, setting new records on the BEIR and MTEB benchmarks.

Patented LLMs like GPT-4 have been used to produce a diverse range of synthetic data that includes multilingual instructions. On the fiercely competitive MTEB benchmark, the method has achieved remarkable performance in nearly all work categories by using the powerful language understanding capabilities of the Mistral model.

In conclusion, this study shows that using LLMs can significantly increase the quality of text embeddings. The training procedure of this study greatly eliminates the need for intermediate pre-training and is more streamlined and effective than current multi-stage systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, Twitter, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🎯 Meet Meetgeek: your personal AI Meeting Assistant…. Try it now!.

Credit: Source link