Training Improved Text Embeddings with Large Language Models

On Jan 11, 2024

Text embeddings are vector representations of words, sentences, paragraphs or documents that capture their semantic meaning. They serve as a core building block in many natural language processing (NLP) applications today, including information retrieval, question answering, semantic search and more.

vector embedding

Recent advances in large language models (LLMs) like GPT-3 have shown impressive capabilities in few-shot learning and natural language generation. Can we leverage LLMs to also advance the state of text embeddings? In their paper “Improving Text Embeddings with Large Language Models“, researchers from Microsoft propose a novel method that achieves superior results by generating synthetic training data with LLMs and fine-tuning on it.

Challenges with Existing Methods

Traditional text embedding techniques like weighted averages of word vectors or TF-IDF fail to adequately capture the rich contextual information in text. More recent methods based on pre-trained language models like BERT obtain much better context-aware embeddings.

However, they require complex multi-stage training pipelines:

Pre-train on billions of weakly labeled or artificial text pairs
Fine-tune on limited hand-curated datasets

This demands massive compute resources and human effort for data collection. The training data is also constrained in diversity and language coverage. For instance, the BEIR benchmark comprises datasets for only 15 retrieval tasks in English.

Existing methods predominantly use smaller BERT-style architectures as the backbone model. They are unable to take advantage of more advanced LLMs and related techniques.

Methodology: Synthetic Data Generation with LLMs

To overcome these limitations, the researchers propose a novel single-stage training approach that leverages LLMs like GPT-3 and GPT-4 to generate diverse synthetic training data.

The key steps are:

Task Taxonomy: Define a taxonomy that categorizes text embedding tasks into:
- Asymmetric tasks (query and document not paraphrases e.g. search)
- Symmetric tasks (query and document are paraphrases e.g. semantic similarity)
Prompt Design: Create prompt templates tailored to each task type that guide the LLM to generate relevant training examples.
Synthetic Data Generation: Prompt the LLM with the designed prompts to generate hundreds of thousands of (query, document) pairs covering a wide variety of semantic tasks across 93 languages.
Model Training: Fine-tune a powerful open-source LLM such as Mistral on the synthetic data using contrastive loss.

This methodology allows creating ample training data for diverse tasks in multiple languages without any human labeling effort. By leveraging the knowledge already embedded in LLMs through pre-training on web-scale corpora, we can synthesize high-quality data precisely tailored for text embeddings.

The researchers demonstrate this with a 2-step prompting strategy:

Prompt GPT-4 to suggest potential retrieval tasks

Prompt for generating high-level retrieval tasks

Prompt it again to generate (query, document) samples based on the suggested tasks

n generate (query, positive, hard negative) triplets

Some key aspects of the prompt design:

Natural language prompts for intuitive human-like instructions
Placeholders to encourage diversity (e.g. query length, clarity, document length)
Combining data from multiple templates for the same task type
Weighting languages based on resource availability

In total, they were able to generate 500k text embedding examples at a compute cost of 180M tokens. The dominant language was English (43%) followed by Polish, Japanese, Italian and others.

For model training, they opted for fine-tuning the open-source 7B parameter Mistral model instead of smaller BERT-style architectures. Since Mistral was already pre-trained on massive text corpora, no additional contrastive pre-training was needed. Adding it provided negligible improvements.

The entire fine-tuning took less than 1k steps, using a mix of synthetic and human-labeled data. This demonstrates the sample efficiency of the proposed approach.

Results

The researchers evaluated their model on the MTEB benchmark, which covers diverse tasks across classification, clustering, semantic similarity, summarization and information retrieval.

Their model outperformed previous state-of-the-art by 2.4 points in average score, establishing new records for nearly every category:

Model	Previous SOTA	Proposed Model
Classification	76.0	78.5
Clustering	46.1	50.3
Pairwise Classification	87.1	88.3
Reranking	60.0	60.2
Retrieval	54.3	56.9
STS	83.1	84.6
Summarization	31.6	31.4
Average	64.2	66.6

Remarkably, even without using any labeled data and training solely on synthetic data, it achieved competitive accuracy – only 3.5 points behind the fully supervised model. This demonstrates the viability of generating text embeddings just using LLMs, without human annotation effort.

The researchers also evaluated on the multilingual MIRACL benchmark covering 18 languages. Their model outperformed previous best on high-resource languages but was weaker on low-resource ones. They hypothesize this could be mitigated by pre-training LLMs more extensively on low-resource languages.

In summary, text embeddings trained on LLM-generated synthetic data establish new state-of-the-art results, while using simpler and more efficient training compared to prior multi-stage approaches. With further research intoprompt engineering and synthetic data quality, this methodology could greatly advance multilingual text embeddings.

Analysis

This work offers several valuable takeaways:

LLMs like GPT-3 and GPT-4 have an impressive ability to generate high-quality synthetic training data for diverse NLP tasks when prompted appropriately. This can reduce reliance on human-labeled data.
For text embeddings, contrastive pre-training provides negligible gains over just fine-tuning models like Mistral that already have trillion-scale pre-training. This is an important insight into training efficiency.
Retrieval augmented generation methods are enabling LLMs to dynamically access external knowledge. Hence improving text embeddings is valuable for enhancing these LLMs.
There is significant room for improvement in low-resource languages. Multilingual LLMs pre-trained on more representative data could help close this gap.
Conceptually, language modeling and text embeddings are two sides of the same coin – understanding language semantics. With synthetic data prompting, LLMs can be organically fine-tuned into embedders without complex pipelines.

Some promising directions for future work include:

Leveraging open-source LLMs like GPT-NeoX to generate synthetic data
Exploring lightweight post-training to adapt embedders to longer contexts
Development of prompt engineering techniques to control quality and task coverage
Methods to improve inference latency and storage costs for industrial usage

Beyond beating benchmarks, employing large language models to enhance text embeddings opens up intriguing possibilities for the future. As LLMs continue to advance in their mastery over natural language, their aptitude for generating high-fidelity synthetic data is likely to improve as well.

However, critical research directions remain to translate this potential into real-world impact.

Customization and Control

A key benefit of synthetic data is the ability to programmatically generate examples tailored to specific needs. As the paper demonstrated, prompt engineering allows creating training data for hundreds of thousands of embedding tasks.

Yet, current prompt design practices remain more an art than science. Developing systematic, reproducible methods to precisely control the properties of generated data would expand the applicability of this technique.

For instance, techniques to modulate factors like the complexity, ambiguity and novelty of examples could help address robustness issues in downstream tasks. Dynamic prompt generation to match evolving real-world distributions is another open challenge.

Training at Scale

While pre-trained LLMs already encode substantial linguistic knowledge, their data generation skills are likely to enhance further with additional scale. Models like GPT-4 trained on trillions of tokens of internet text exhibit strong few-shot learning, but have not been optimized specifically for synthesizing training data.

Architectures and objectives tailored to bootstrapping self-supervised data generation at web-scale could substantially advance the quality and efficiency of this methodology. Efficient integration of retrieved knowledge to complement learned knowledge is another promising direction.

Multitask and Multilingual

As the paper noted, improving performance on low-resource languages remains an issue. Rather than pre-train a single massive LLM, an alternative is training a fleet of smaller expert models that specialize in particular data modalities or language domains.

Such an ensemble approach could help improve coverage over rare tasks and languages by sharing representations learned across experts. Continual learning to expand language and task expertise over time is also an exciting prospect.

In conclusion, this paper introduces an innovative concept of synthesizing training data from LLMs to create performant text embeddings. Their results demonstrate the effectiveness of this methodology, outperforming previous benchmarks. As LLMs and synthetic data techniques progress, tapping into their knowledge to train embedders could become a highly promising direction.

Credit: Source link