Skip to content

Binary Quantization with Qdrant & OpenAI Embedding


In the world of large-scale data retrieval and processing, efficiency is crucial. With the exponential growth of data, the ability to retrieve information quickly and accurately can significantly affect system performance. This blog post explores a technique known as binary quantization applied to OpenAI embeddings, demonstrating how it can enhance retrieval latency by 20x or more.

What Are OpenAI Embeddings?

OpenAI embeddings are numerical representations of textual information. They transform text into a vector space where semantically similar texts are mapped close together. This mathematical representation enables computers to understand and process human language more effectively.

Binary Quantization

Binary quantization is a method which converts continuous numerical values into binary values (0 or 1). It simplifies the data structure, allowing faster computations. Here's a brief overview of the binary quantization process applied to OpenAI embeddings:

  1. Load Embeddings: OpenAI embeddings are loaded from parquet files.
  2. Binary Transformation: The continuous valued vectors are converted into binary form. Here, values greater than 0 are set to 1, and others remain 0.
  3. Comparison & Retrieval: Binary vectors are used for comparison using logical XOR operations and other efficient algorithms.

Binary Quantization is a promising approach to improve retrieval speeds and reduce memory footprint of vector search engines. In this notebook we will show how to use Qdrant to perform binary quantization of vectors and perform fast similarity search on the resulting index.

Table of Contents

  1. Imports
  2. Download and Slice Dataset
  3. Create Qdrant Collection
  4. Indexing
  5. Search

1. Imports

!pip install qdrant-client pandas dataset --quiet --upgrade
import os
import random
import time

import numpy as np
import pandas as pd
from qdrant_client import QdrantClient, models

random.seed(37)
np.random.seed(37)
/Users/joein/work/qdrant/fastembed/venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

2. Download and Slice Dataset

We will be using the dbpedia-entities dataset from the HuggingFace Datasets library. This contains 100K vectors of 1536 dimensions each

import datasets

dataset = datasets.load_dataset(
    "Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K", split="train"
)
len(dataset)
100000
client = QdrantClient(
    prefer_grpc=True,
)

collection_name = "binary-quantization"
client.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=1536,
        distance=models.Distance.DOT,
        on_disk=True,
    ),
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(always_ram=True),
    ),
)
True
def iter_dataset(dataset):
    for point in dataset:
        yield point["openai"], {"text": point["text"]}


vectors, payload = zip(*iter_dataset(dataset))
client.upload_collection(
    collection_name=collection_name,
    vectors=vectors,
    payload=payload,
    parallel=max(1, (os.cpu_count() // 2)),
)
collection_info = client.get_collection(collection_name=f"{collection_name}")
collection_info.dict()
{'status': <CollectionStatus.YELLOW: 'yellow'>,
 'optimizer_status': <OptimizersStatusOneOf.OK: 'ok'>,
 'vectors_count': 116640,
 'indexed_vectors_count': 43520,
 'points_count': 116640,
 'segments_count': 6,
 'config': {'params': {'vectors': {'size': 1536,
    'distance': <Distance.DOT: 'Dot'>,
    'hnsw_config': None,
    'quantization_config': None,
    'on_disk': True},
   'shard_number': 1,
   'sharding_method': None,
   'replication_factor': 1,
   'write_consistency_factor': 1,
   'read_fan_out_factor': None,
   'on_disk_payload': True,
   'sparse_vectors': None},
  'hnsw_config': {'m': 16,
   'ef_construct': 100,
   'full_scan_threshold': 10000,
   'max_indexing_threads': 0,
   'on_disk': False,
   'payload_m': None},
  'optimizer_config': {'deleted_threshold': 0.2,
   'vacuum_min_vector_number': 1000,
   'default_segment_number': 0,
   'max_segment_size': None,
   'memmap_threshold': None,
   'indexing_threshold': 20000,
   'flush_interval_sec': 5,
   'max_optimization_threads': None},
  'wal_config': {'wal_capacity_mb': 32, 'wal_segments_ahead': 0},
  'quantization_config': {'binary': {'always_ram': True}}},
 'payload_schema': {}}

Oversampling vs Recall

Preparing a query dataset

For the purpose of this illustration, we'll take a few vectors which we know are already in the index and query them. We should get the same vectors back as results from the Qdrant index.

query_indices = random.sample(range(len(dataset)), 100)
query_dataset = dataset[query_indices]
query_indices
[89391,
 79659,
 12006,
 80978,
 87219,
 97885,
 83155,
 67504,
 4645,
 82711,
 48395,
 57375,
 69208,
 14136,
 89515,
 59880,
 78730,
 36952,
 49620,
 96486,
 55473,
 58179,
 18926,
 6489,
 11931,
 54146,
 9850,
 71259,
 37825,
 47331,
 84964,
 92399,
 56669,
 77042,
 73744,
 47993,
 83780,
 92429,
 75114,
 4463,
 69030,
 81185,
 27950,
 66217,
 54652,
 8260,
 1151,
 993,
 85954,
 66863,
 47303,
 8992,
 92688,
 76030,
 29472,
 3077,
 42454,
 46120,
 69140,
 20877,
 2844,
 95423,
 1770,
 28568,
 96448,
 94227,
 40837,
 91684,
 29785,
 66936,
 85121,
 39546,
 81910,
 5514,
 37068,
 35731,
 93990,
 26685,
 63076,
 18762,
 27922,
 34916,
 80976,
 83189,
 6328,
 57508,
 58860,
 13758,
 72976,
 85030,
 332,
 34963,
 85009,
 31344,
 11560,
 58108,
 85163,
 17064,
 44712,
 45962]
## Add Gaussian noise to any vector


def add_noise(vector, noise=0.05):
    return vector + noise * np.random.randn(*vector.shape)
def correct(results, text):
    return text in [x.payload["text"] for x in results]


def count_correct(query_dataset, limit=1, oversampling=1, rescore=False):
    correct_results = 0
    for query_vector, text in zip(query_dataset["openai"], query_dataset["text"]):
        results = client.search(
            collection_name=collection_name,
            query_vector=add_noise(np.array(query_vector)),
            limit=limit,
            search_params=models.SearchParams(
                quantization=models.QuantizationSearchParams(
                    rescore=rescore,
                    oversampling=oversampling,
                )
            ),
        )
        correct_results += correct(results, text)
    return correct_results
limit_grid = [1, 3, 10, 20, 50]
oversampling_grid = [1.0, 3.0, 5.0]
rescore_grid = [False, True]
results = []

for limit in limit_grid:
    for oversampling in oversampling_grid:
        for rescore in rescore_grid:
            start = time.perf_counter()
            correct_results = count_correct(
                query_dataset, limit=limit, oversampling=oversampling, rescore=rescore
            )
            end = time.perf_counter()
            results.append(
                {
                    "limit": limit,
                    "oversampling": oversampling,
                    "candidates": int(oversampling * limit),
                    "rescore": rescore,
                    "accuracy": correct_results / 100,
                    "total queries": len(query_dataset["text"]),
                    "time": end - start,
                }
            )
df = pd.DataFrame(results)
df[["candidates", "rescore", "accuracy", "time"]]
# df.to_csv("candidates-rescore-time.csv", index=False)
candidates rescore accuracy time
0 1 False 0.90 0.221826
1 1 True 0.91 0.134167
2 3 False 0.88 0.115299
3 3 True 0.97 0.209320
4 5 False 0.84 0.154485
5 5 True 0.91 0.124424
6 3 False 0.99 0.121695
7 3 True 0.96 0.123257
8 9 False 0.94 0.119629
9 9 True 0.98 0.119372
10 15 False 0.90 0.121621
11 15 True 0.97 0.125466
12 10 False 0.93 0.135910
13 10 True 0.95 0.138135
14 30 False 0.94 0.177928
15 30 True 0.98 0.254588
16 50 False 0.94 0.268659
17 50 True 0.96 0.269792
18 20 False 0.96 0.249941
19 20 True 0.96 0.247138
20 60 False 0.97 0.251301
21 60 True 0.98 0.256504
22 100 False 0.98 0.270049
23 100 True 0.97 0.248972
24 50 False 0.97 0.306356
25 50 True 0.98 0.257544
26 150 False 0.98 0.238811
27 150 True 0.99 0.263939
28 250 False 0.99 0.256558
29 250 True 1.00 0.335823