Introduction to SPLADE with FastEmbed
In this notebook, we will explore how to generate Sparse Vectors -- in particular a variant of the SPLADE.
> 💡 The original naver/SPLADE models were licensed CC BY-NC-SA 4.0 -- Not for Commercial Use. This SPLADE++ model is Apache License and hence, licensed for commercial use.
Outline:
- What is SPLADE?
- Setting up the environment
- Generating SPLADE vectors with FastEmbed
- Understanding SPLADE vectors
- Observations and Design Choices
What is SPLADE?
SPLADE was a novel method for learning sparse vectors for text representation. This model beats BM25 -- the underlying approach for the Elastic/Lucene family of implementations. Thus making it highly effective for tasks such as information retrieval, document classification, and more.
The key advantage of SPLADE is its ability to generate sparse vectors, which are more efficient and interpretable than dense vectors. This makes SPLADE a powerful tool for handling large-scale text data.
Setting up the environment
This notebook uses few dependencies, which are installed below:
# !pip install -q fastembed
Let's get started! 🚀
from fastembed import SparseTextEmbedding, SparseEmbedding
from typing import List
> You can find the list of all supported Sparse Embedding models by calling this API: SparseTextEmbedding.list_supported_models()
SparseTextEmbedding.list_supported_models()
model_name = "prithvida/Splade_PP_en_v1"
# This triggers the model download
model = SparseTextEmbedding(model_name=model_name)
documents: List[str] = [
"Chandrayaan-3 is India's third lunar mission",
"It aimed to land a rover on the Moon's surface - joining the US, China and Russia",
"The mission is a follow-up to Chandrayaan-2, which had partial success",
"Chandrayaan-3 will be launched by the Indian Space Research Organisation (ISRO)",
"The estimated cost of the mission is around $35 million",
"It will carry instruments to study the lunar surface and atmosphere",
"Chandrayaan-3 landed on the Moon's surface on 23rd August 2023",
"It consists of a lander named Vikram and a rover named Pragyan similar to Chandrayaan-2. Its propulsion module would act like an orbiter.",
"The propulsion module carries the lander and rover configuration until the spacecraft is in a 100-kilometre (62 mi) lunar orbit",
"The mission used GSLV Mk III rocket for its launch",
"Chandrayaan-3 was launched from the Satish Dhawan Space Centre in Sriharikota",
"Chandrayaan-3 was launched earlier in the year 2023",
]
sparse_embeddings_list: List[SparseEmbedding] = list(
model.embed(documents, batch_size=6)
) # batch_size is optional, notice the generator
index = 0
sparse_embeddings_list[index]
The previous output is a SparseEmbedding object for the first document in our list.
It contains two arrays: values and indices. - The 'values' array represents the weights of the features (tokens) in the document. - The 'indices' array represents the indices of these features in the model's vocabulary.
Each pair of corresponding values and indices represents a token and its weight in the document.
# Let's print the first 5 features and their weights for better understanding.
for i in range(5):
print(f"Token at index {sparse_embeddings_list[0].indices[i]} has weight {sparse_embeddings_list[0].values[i]}")
Understanding SPLADE vectors
This is still a little abstract, so let's use the tokenizer vocab to make sense of these indices.
import json
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(SparseTextEmbedding.list_supported_models()[0]["sources"]["hf"])
def get_tokens_and_weights(sparse_embedding, tokenizer):
token_weight_dict = {}
for i in range(len(sparse_embedding.indices)):
token = tokenizer.decode([sparse_embedding.indices[i]])
weight = sparse_embedding.values[i]
token_weight_dict[token] = weight
# Sort the dictionary by weights
token_weight_dict = dict(sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True))
return token_weight_dict
# Test the function with the first SparseEmbedding
print(json.dumps(get_tokens_and_weights(sparse_embeddings_list[index], tokenizer), indent=4))
Observations and Model Design Choices
- The relative order of importance is quite useful. The most important tokens in the sentence have the highest weights.
- Term Expansion: The model can expand the terms in the document. This means that the model can generate weights for tokens that are not present in the document but are related to the tokens in the document. This is a powerful feature that allows the model to capture the context of the document. Here, you'll see that the model has added the tokens '3' from 'third' and 'moon' from 'lunar' to the sparse vector.
Design Choices
- The weights are not normalized. This means that the sum of the weights is not 1 or 100. This is a common practice in sparse embeddings, as it allows the model to capture the importance of each token in the document.
- Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training.
- Tokens do not map to words directly -- allowing you to gracefully handle typo errors and out-of-vocabulary tokens.