Introduction to SPLADE with FastEmbed

In this notebook, we will explore how to generate Sparse Vectors -- in particular a variant of the SPLADE.

> 💡 The original naver/SPLADE models were licensed CC BY-NC-SA 4.0 -- Not for Commercial Use. This SPLADE++ model is Apache License and hence, licensed for commercial use.

What is SPLADE?

SPLADE was a novel method for learning sparse vectors for text representation. This model beats BM25 -- the underlying approach for the Elastic/Lucene family of implementations. Thus making it highly effective for tasks such as information retrieval, document classification, and more.

The key advantage of SPLADE is its ability to generate sparse vectors, which are more efficient and interpretable than dense vectors. This makes SPLADE a powerful tool for handling large-scale text data.

Setting up the environment

This notebook uses few dependencies, which are installed below:

# !pip install -q fastembed

Let's get started! 🚀

from fastembed import SparseTextEmbedding, SparseEmbedding
from typing import List

> You can find the list of all supported Sparse Embedding models by calling this API: SparseTextEmbedding.list_supported_models()

SparseTextEmbedding.list_supported_models()

[{'model': 'prithvida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Misspelled version of the model. Retained for backward compatibility. Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}},
 {'model': 'prithivida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}}]

model_name = "prithvida/Splade_PP_en_v1"
# This triggers the model download
model = SparseTextEmbedding(model_name=model_name)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

documents: List[str] = [
    "Chandrayaan-3 is India's third lunar mission",
    "It aimed to land a rover on the Moon's surface - joining the US, China and Russia",
    "The mission is a follow-up to Chandrayaan-2, which had partial success",
    "Chandrayaan-3 will be launched by the Indian Space Research Organisation (ISRO)",
    "The estimated cost of the mission is around $35 million",
    "It will carry instruments to study the lunar surface and atmosphere",
    "Chandrayaan-3 landed on the Moon's surface on 23rd August 2023",
    "It consists of a lander named Vikram and a rover named Pragyan similar to Chandrayaan-2. Its propulsion module would act like an orbiter.",
    "The propulsion module carries the lander and rover configuration until the spacecraft is in a 100-kilometre (62 mi) lunar orbit",
    "The mission used GSLV Mk III rocket for its launch",
    "Chandrayaan-3 was launched from the Satish Dhawan Space Centre in Sriharikota",
    "Chandrayaan-3 was launched earlier in the year 2023",
]
sparse_embeddings_list: List[SparseEmbedding] = list(
    model.embed(documents, batch_size=6)
)  # batch_size is optional, notice the generator

index = 0
sparse_embeddings_list[index]

SparseEmbedding(values=array([0.05297208, 0.01963477, 0.36459631, 1.38508618, 0.71776593,
       0.12667948, 0.46230844, 0.446771  , 0.26897505, 1.01519883,
       1.5655334 , 0.29412213, 1.53102326, 0.59785569, 1.1001817 ,
       0.02079751, 0.09955651, 0.44249091, 0.09747757, 1.53519952,
       1.36765671, 0.15740395, 0.49882549, 0.38629025, 0.76612782,
       1.25805044, 0.39058095, 0.27236196, 0.45152301, 0.48262018,
       0.26085234, 1.35912788, 0.70710695, 1.71639752]), indices=array([ 1010,  1011,  1016,  1017,  2001,  2018,  2034,  2093,  2117,
        2319,  2353,  2509,  2634,  2686,  2796,  2817,  2922,  2959,
        3003,  3148,  3260,  3390,  3462,  3523,  3822,  4231,  4316,
        4774,  5590,  5871,  6416, 11926, 12076, 16469]))

The previous output is a SparseEmbedding object for the first document in our list.

It contains two arrays: values and indices. - The 'values' array represents the weights of the features (tokens) in the document. - The 'indices' array represents the indices of these features in the model's vocabulary.

Each pair of corresponding values and indices represents a token and its weight in the document.

# Let's print the first 5 features and their weights for better understanding.
for i in range(5):
    print(f"Token at index {sparse_embeddings_list[0].indices[i]} has weight {sparse_embeddings_list[0].values[i]}")

Token at index 1010 has weight 0.05297207832336426
Token at index 1011 has weight 0.01963476650416851
Token at index 1016 has weight 0.36459630727767944
Token at index 1017 has weight 1.385086178779602
Token at index 2001 has weight 0.7177659273147583

Understanding SPLADE vectors

This is still a little abstract, so let's use the tokenizer vocab to make sense of these indices.

import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(SparseTextEmbedding.list_supported_models()[0]["sources"]["hf"])

def get_tokens_and_weights(sparse_embedding, tokenizer):
    token_weight_dict = {}
    for i in range(len(sparse_embedding.indices)):
        token = tokenizer.decode([sparse_embedding.indices[i]])
        weight = sparse_embedding.values[i]
        token_weight_dict[token] = weight

    # Sort the dictionary by weights
    token_weight_dict = dict(sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True))
    return token_weight_dict


# Test the function with the first SparseEmbedding
print(json.dumps(get_tokens_and_weights(sparse_embeddings_list[index], tokenizer), indent=4))

{
    "chandra": 1.7163975238800049,
    "third": 1.5655333995819092,
    "##ya": 1.535199522972107,
    "india": 1.5310232639312744,
    "3": 1.385086178779602,
    "mission": 1.3676567077636719,
    "lunar": 1.3591278791427612,
    "moon": 1.2580504417419434,
    "indian": 1.1001816987991333,
    "##an": 1.015198826789856,
    "3rd": 0.7661278247833252,
    "was": 0.7177659273147583,
    "spacecraft": 0.7071069478988647,
    "space": 0.5978556871414185,
    "flight": 0.4988254904747009,
    "satellite": 0.4826201796531677,
    "first": 0.46230843663215637,
    "expedition": 0.4515230059623718,
    "three": 0.4467709958553314,
    "fourth": 0.44249090552330017,
    "vehicle": 0.390580952167511,
    "iii": 0.3862902522087097,
    "2": 0.36459630727767944,
    "##3": 0.2941221296787262,
    "planet": 0.27236196398735046,
    "second": 0.26897504925727844,
    "missions": 0.2608523368835449,
    "launched": 0.15740394592285156,
    "had": 0.12667948007583618,
    "largest": 0.09955651313066483,
    "leader": 0.09747757017612457,
    ",": 0.05297207832336426,
    "study": 0.02079751156270504,
    "-": 0.01963476650416851
}

Observations and Model Design Choices

The relative order of importance is quite useful. The most important tokens in the sentence have the highest weights.
Term Expansion: The model can expand the terms in the document. This means that the model can generate weights for tokens that are not present in the document but are related to the tokens in the document. This is a powerful feature that allows the model to capture the context of the document. Here, you'll see that the model has added the tokens '3' from 'third' and 'moon' from 'lunar' to the sparse vector.

Design Choices

The weights are not normalized. This means that the sum of the weights is not 1 or 100. This is a common practice in sparse embeddings, as it allows the model to capture the importance of each token in the document.
Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training.
Tokens do not map to words directly -- allowing you to gracefully handle typo errors and out-of-vocabulary tokens.