Late Interaction Text Embedding Models

As of version 0.3.0 FastEmbed supports Late Interaction Text Embedding Models and currently available with one of the most popular embedding model of the family - ColBERT.

What is a Late Interaction Text Embedding Model?

Late Interaction Text Embedding Model is a kind of information retrieval model which performs query and documents interactions at the scoring stage. In order to better understand it, we can compare it to the models without interaction.
For instance, if you take a sentence-transformer model, compute embeddings for your documents, compute embeddings for your queries, and just compare them by cosine similarity, then you're retrieving points without interaction.

It is a pretty much easy and straightforward approach, however we might be sacrificing some precision due to its simplicity. It is caused by several facts: - there is no interaction between queries and documents at the early stage (embedding generation) nor at the late stage (during scoring). - we are trying to encapsulate all the document information in only one pooled embedding, and obviously, some information might be lost.

Late Interaction Text Embedding models are trying to address it by computing embeddings for each token in queries and documents, and then finding the most similar ones via model specific operation, e.g. ColBERT (Contextual Late Interaction over BERT) uses MaxSim operation. With this approach we can have not only a better representation of the documents, but also make queries and documents more aware one of another.

For more information on ColBERT and MaxSim operation, you can check out this blogpost by Jina AI.

ColBERT in FastEmbed

FastEmbed provides a simple way to use ColBERT model, similar to the ones it has with TextEmbedding.

from fastembed import LateInteractionTextEmbedding

LateInteractionTextEmbedding.list_supported_models()

/Users/joein/work/qdrant/fastembed/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[{'model': 'colbert-ir/colbertv2.0',
  'dim': 128,
  'description': 'Late interaction model',
  'size_in_GB': 0.44,
  'sources': {'hf': 'colbert-ir/colbertv2.0'},
  'model_file': 'model.onnx'}]

embedding_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]
config.json: 100%|██████████| 743/743 [00:00<00:00, 4.56MB/s]

tokenizer_config.json: 100%|██████████| 405/405 [00:00<00:00, 3.34MB/s]
Fetching 5 files:  20%|██        | 1/5 [00:00<00:01,  3.64it/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 727kB/s]

tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.48MB/s]

model.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]
model.onnx:   2%|▏         | 10.5M/436M [00:00<00:34, 12.2MB/s]
model.onnx:   5%|▍         | 21.0M/436M [00:01<00:20, 20.3MB/s]
model.onnx:   7%|▋         | 31.5M/436M [00:01<00:15, 25.7MB/s]
model.onnx:  10%|▉         | 41.9M/436M [00:01<00:13, 29.4MB/s]
model.onnx:  12%|█▏        | 52.4M/436M [00:01<00:12, 31.9MB/s]
model.onnx:  14%|█▍        | 62.9M/436M [00:02<00:11, 33.7MB/s]
model.onnx:  17%|█▋        | 73.4M/436M [00:02<00:10, 34.9MB/s]
model.onnx:  19%|█▉        | 83.9M/436M [00:02<00:09, 35.5MB/s]
model.onnx:  22%|██▏       | 94.4M/436M [00:03<00:09, 36.1MB/s]
model.onnx:  24%|██▍       | 105M/436M [00:03<00:09, 36.6MB/s] 
model.onnx:  26%|██▋       | 115M/436M [00:03<00:08, 36.9MB/s]
model.onnx:  29%|██▉       | 126M/436M [00:03<00:08, 37.1MB/s]
model.onnx:  31%|███▏      | 136M/436M [00:04<00:08, 37.3MB/s]
model.onnx:  34%|███▎      | 147M/436M [00:04<00:07, 37.4MB/s]
model.onnx:  36%|███▌      | 157M/436M [00:04<00:07, 37.4MB/s]
model.onnx:  38%|███▊      | 168M/436M [00:05<00:07, 37.5MB/s]
model.onnx:  41%|████      | 178M/436M [00:05<00:06, 37.6MB/s]
model.onnx:  43%|████▎     | 189M/436M [00:05<00:06, 37.6MB/s]
model.onnx:  46%|████▌     | 199M/436M [00:05<00:06, 37.6MB/s]
model.onnx:  48%|████▊     | 210M/436M [00:06<00:06, 37.5MB/s]
model.onnx:  50%|█████     | 220M/436M [00:06<00:05, 37.5MB/s]
model.onnx:  53%|█████▎    | 231M/436M [00:06<00:05, 37.6MB/s]
model.onnx:  55%|█████▌    | 241M/436M [00:06<00:05, 37.6MB/s]
model.onnx:  58%|█████▊    | 252M/436M [00:07<00:04, 37.6MB/s]
model.onnx:  60%|██████    | 262M/436M [00:07<00:04, 37.7MB/s]
model.onnx:  63%|██████▎   | 273M/436M [00:07<00:04, 37.7MB/s]
model.onnx:  65%|██████▍   | 283M/436M [00:08<00:04, 36.0MB/s]
model.onnx:  67%|██████▋   | 294M/436M [00:08<00:03, 36.4MB/s]
model.onnx:  70%|██████▉   | 304M/436M [00:08<00:03, 36.8MB/s]
model.onnx:  72%|███████▏  | 315M/436M [00:08<00:03, 37.0MB/s]
model.onnx:  75%|███████▍  | 325M/436M [00:09<00:02, 37.3MB/s]
model.onnx:  77%|███████▋  | 336M/436M [00:09<00:03, 30.8MB/s]
model.onnx:  79%|███████▉  | 346M/436M [00:10<00:02, 32.6MB/s]
model.onnx:  82%|████████▏ | 357M/436M [00:10<00:02, 33.9MB/s]
model.onnx:  84%|████████▍ | 367M/436M [00:10<00:01, 34.8MB/s]
model.onnx:  87%|████████▋ | 377M/436M [00:10<00:01, 35.7MB/s]
model.onnx:  89%|████████▉ | 388M/436M [00:11<00:01, 36.2MB/s]
model.onnx:  91%|█████████▏| 398M/436M [00:11<00:01, 36.6MB/s]
model.onnx:  94%|█████████▍| 409M/436M [00:11<00:00, 36.9MB/s]
model.onnx:  96%|█████████▌| 419M/436M [00:11<00:00, 37.1MB/s]
model.onnx:  99%|█████████▊| 430M/436M [00:12<00:00, 37.3MB/s]
model.onnx: 100%|██████████| 436M/436M [00:12<00:00, 35.1MB/s]
Fetching 5 files: 100%|██████████| 5/5 [00:13<00:00,  2.68s/it]

documents = [
    "ColBERT is a late interaction text embedding model, however, there are also other models such as TwinBERT.",
    "On the contrary to the late interaction models, the early interaction models contains interaction steps at embedding generation process",
]
queries = [
    "Are there any other late interaction text embedding models except ColBERT?",
    "What is the difference between late interaction and early interaction text embedding models?",
]

NOTE: ColBERT computes query and documents embeddings differently, make sure to use the corresponding methods.

document_embeddings = list(
    embedding_model.embed(documents)
)  # embed and qury_embed return generators,
# which we need to evaluate by writing them to a list
query_embeddings = list(embedding_model.query_embed(queries))

document_embeddings[0].shape, query_embeddings[0].shape

((26, 128), (32, 128))

Don't worry about query embeddings having the bigger shape in this case. ColBERT authors recommend to pad queries with [MASK] tokens to 32 tokens. They also recommends to truncate queries to 32 tokens, however we don't do that in FastEmbed, so you can put some straight into the queries.

MaxSim operator

Qdrant will support ColBERT as of the next version (v1.10), however, at the moment, you can compute embedding similarities manually.

import numpy as np


def compute_relevance_scores(
    query_embedding: np.array, document_embeddings: np.array, k: int
) -&gt; list[int]:
    """
    Compute relevance scores for top-k documents given a query.

    :param query_embedding: Numpy array representing the query embedding, shape: [num_query_terms, embedding_dim]
    :param document_embeddings: Numpy array representing embeddings for documents, shape: [num_documents, max_doc_length, embedding_dim]
    :param k: Number of top documents to return
    :return: Indices of the top-k documents based on their relevance scores
    """
    # Compute batch dot-product of query_embedding and document_embeddings
    # Resulting shape: [num_documents, num_query_terms, max_doc_length]
    scores = np.matmul(query_embedding, document_embeddings.transpose(0, 2, 1))

    # Apply max-pooling across document terms (axis=2) to find the max similarity per query term
    # Shape after max-pool: [num_documents, num_query_terms]
    max_scores_per_query_term = np.max(scores, axis=2)

    # Sum the scores across query terms to get the total score for each document
    # Shape after sum: [num_documents]
    total_scores = np.sum(max_scores_per_query_term, axis=1)

    # Sort the documents based on their total scores and get the indices of the top-k documents
    sorted_indices = np.argsort(total_scores)[::-1][:k]

    return sorted_indices

sorted_indices = compute_relevance_scores(
    np.array(query_embeddings[0]), np.array(document_embeddings), k=3
)
print("Sorted document indices:", sorted_indices)

Sorted document indices: [0 1]

print(f"Query: {queries[0]}")
for index in sorted_indices:
    print(f"Document: {documents[index]}")

Query: Are there any other late interaction text embedding models except ColBERT?
Document: ColBERT is a late interaction text embedding model, however, there are also other models such as TwinBERT.
Document: On the contrary to the late interaction models, the early interaction models contains interaction steps at embedding generation process

Use-case recommendation

Despite ColBERT allows to compute embeddings independently and spare some workload offline, it still computes more resources than no interaction models. Due to this, it might be more reasonable to use ColBERT not as a first-stage retriever, but as a re-ranker.

The first-stage retriever would then be a no-interaction model, which e.g. retrieves first 100 or 500 examples, and leave the final ranking to the ColBERT model.