Skip to content

FastEmbed on GPU

As of version 0.2.7 FastEmbed supports GPU acceleration.

This notebook covers the installation process and usage of fastembed on GPU.

Installation

Fastembed depends on onnxruntime and inherits its scheme of GPU support.

In order to use GPU with onnx models, you would need to have onnxruntime-gpu package, which substitutes all the onnxruntime functionality. Fastembed mimics this behavior and requires fastembed-gpu package to be installed.

!pip install fastembed-gpu

NOTE: onnxruntime-gpu and onnxruntime can't be installed in the same environment. If you have onnxruntime installed, you would need to uninstall it before installing onnxruntime-gpu. Same is true for fastembed and fastembed-gpu.

CUDA 12.x support

By default onnxruntime-gpu is shipped with CUDA 11.8 support. CUDA 12.x support requires installation of onnxruntime-gpu with providing of a direct url:

!pip install onnxruntime-gpu -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ -qq
!pip install fastembed-gpu -qqq

You can check your CUDA version using such commands as nvidia-smi or nvcc --version

Google Colab notebooks have CUDA 12.x.

CUDA drivers

FastEmbed does not include CUDA drivers and CuDNN libraries. You would need to take care of the environment setup on your own. Dependencies required for the chosen onnxruntime version can be found here

Usage

from typing import List

import numpy as np

from fastembed import TextEmbedding

embedding_model_gpu = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5", providers=["CUDAExecutionProvider"]
)
embedding_model_gpu.model.model.get_providers()
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]
tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]
config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]
model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]
['CUDAExecutionProvider', 'CPUExecutionProvider']
documents: List[str] = list(np.repeat("Demonstrating GPU acceleration in fastembed", 500))
%%timeit
list(embedding_model_gpu.embed(documents))
43.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

embedding_model_cpu = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
embedding_model_cpu.model.model.get_providers()
Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]
['CPUExecutionProvider']
%%timeit
list(embedding_model_cpu.embed(documents))
4.33 s ± 591 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)