🤗 Huggingface vs ⚡ FastEmbed️

Comparing the performance of Huggingface's 🤗 Transformers and ⚡ FastEmbed️ on a simple task on the following machine: Apple M2 Max, 32 GB RAM.

📦 Imports

Importing the necessary libraries for this comparison.

!pip install matplotlib transformers torch -qq

import time
from typing import Callable

import matplotlib.pyplot as plt
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoModel, AutoTokenizer

from fastembed import TextEmbedding

import fastembed

fastembed.__version__

'0.2.6'

📖 Data

data is a list of strings, each string is a document.

documents: list[str] = [
    "Chandrayaan-3 is India's third lunar mission",
    "It aimed to land a rover on the Moon's surface - joining the US, China and Russia",
    "The mission is a follow-up to Chandrayaan-2, which had partial success",
    "Chandrayaan-3 will be launched by the Indian Space Research Organisation (ISRO)",
    "The estimated cost of the mission is around $35 million",
    "It will carry instruments to study the lunar surface and atmosphere",
    "Chandrayaan-3 landed on the Moon's surface on 23rd August 2023",
    "It consists of a lander named Vikram and a rover named Pragyan similar to Chandrayaan-2. Its propulsion module would act like an orbiter.",
    "The propulsion module carries the lander and rover configuration until the spacecraft is in a 100-kilometre (62 mi) lunar orbit",
    "The mission used GSLV Mk III rocket for its launch",
    "Chandrayaan-3 was launched from the Satish Dhawan Space Centre in Sriharikota",
    "Chandrayaan-3 was launched earlier in the year 2023",
]
len(documents)

Setting up 🤗 Huggingface

We'll be using the Huggingface Transformers with PyTorch library to generate embeddings. We'll be using the same model across both libraries for a fair(er?) comparison.

class HF:
    """
    HuggingFace Transformer implementation of FlagEmbedding
    """

    def __init__(self, model_id: str) -&gt; None:
        self.model = AutoModel.from_pretrained(model_id)
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

    def embed(self, texts: list[str]):
        encoded_input = self.tokenizer(
            texts, max_length=512, padding=True, truncation=True, return_tensors="pt"
        )
        model_output = self.model(**encoded_input)
        sentence_embeddings = model_output[0][:, 0]
        sentence_embeddings = F.normalize(sentence_embeddings)
        return sentence_embeddings


model_id = "BAAI/bge-small-en-v1.5"
hf = HF(model_id=model_id)
hf.embed(documents).shape

torch.Size([12, 384])

Setting up ⚡️FastEmbed

Sorry, don't have a lot to set up here. We'll be using the default model, which is Flag Embedding, same as the Huggingface model.

embedding_model = TextEmbedding(model_name=model_id)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

📊 Comparison

We'll be comparing the following metrics: Minimum, Maximum, Mean, across k runs. Let's write a function to do that:

🚀 Calculating Stats

import types


def calculate_time_stats(
    embed_func: Callable, documents: list, k: int
) -&gt; tuple[float, float, float]:
    times = []
    for _ in range(k):
        # Timing the embed_func call
        start_time = time.time()
        embeddings = embed_func(documents)
        # Force computation if embed_func returns a generator
        if isinstance(embeddings, types.GeneratorType):
            list(embeddings)

        end_time = time.time()
        times.append(end_time - start_time)

    # Returning mean, max, and min time for the call
    return (sum(times) / k, max(times), min(times))


hf_stats = calculate_time_stats(hf.embed, documents, k=100)
print(f"Huggingface Transformers (Average, Max, Min): {hf_stats}")
fst_stats = calculate_time_stats(embedding_model.embed, documents, k=100)
print(f"FastEmbed (Average, Max, Min): {fst_stats}")

Huggingface Transformers (Average, Max, Min): (0.04711266994476318, 0.0658111572265625, 0.043084144592285156)
FastEmbed (Average, Max, Min): (0.04384247303009033, 0.05654191970825195, 0.04293417930603027)

📈 Results

Let's run the comparison and see the results.

def plot_character_per_second_comparison(
    hf_stats: tuple[float, float, float], fst_stats: tuple[float, float, float], documents: list
):
    # Calculating total characters in documents
    total_characters = sum(len(doc) for doc in documents)

    # Calculating characters per second for each model
    hf_chars_per_sec = total_characters / hf_stats[0]  # Mean time is at index 0
    fst_chars_per_sec = total_characters / fst_stats[0]

    # Plotting the bar chart
    models = ["HF Embed (Torch)", "FastEmbed"]
    chars_per_sec = [hf_chars_per_sec, fst_chars_per_sec]

    bars = plt.bar(models, chars_per_sec, color=["#1f356c", "#dd1f4b"])
    plt.ylabel("Characters per Second")
    plt.title("Characters Processed per Second Comparison")

    # Adding the number at the top of each bar
    for bar, chars in zip(bars, chars_per_sec):
        plt.text(
            bar.get_x() + bar.get_width() / 2,
            bar.get_height(),
            f"{chars:.1f}",
            ha="center",
            va="bottom",
            color="#1f356c",
            fontsize=12,
        )

    plt.show()


plot_character_per_second_comparison(hf_stats, fst_stats, documents)

No description has been provided for this image

Are the Embeddings the same?

This is a very important question. Let's see if the embeddings are the same.

def calculate_cosine_similarity(embeddings1: Tensor, embeddings2: Tensor) -&gt; float:
    """
    Calculate cosine similarity between two sets of embeddings
    """
    return F.cosine_similarity(embeddings1, embeddings2).mean().item()


calculate_cosine_similarity(hf.embed(documents), Tensor(list(embedding_model.embed(documents))))

/var/folders/b4/grpbcmrd36gc7q5_11whbn540000gn/T/ipykernel_14307/1958479940.py:8: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_new.cpp:278.)
  calculate_cosine_similarity(hf.embed(documents), Tensor(list(embedding_model.embed(documents))))

0.9999992847442627

This indicates the embeddings are quite close to each with a cosine similarity of 0.99 for BAAI/bge-small-en and 0.92 for BAAI/bge-small-en-v1.5. This gives us confidence that the embeddings are the same and we are not sacrificing accuracy for speed.