Skip to content

Binary Quantization with Qdrant & OpenAI Embedding


In the world of large-scale data retrieval and processing, efficiency is crucial. With the exponential growth of data, the ability to retrieve information quickly and accurately can significantly affect system performance. This blog post explores a technique known as binary quantization applied to OpenAI embeddings, demonstrating how it can enhance retrieval latency by 20x or more.

What Are OpenAI Embeddings?

OpenAI embeddings are numerical representations of textual information. They transform text into a vector space where semantically similar texts are mapped close together. This mathematical representation enables computers to understand and process human language more effectively.

Binary Quantization

Binary quantization is a method which converts continuous numerical values into binary values (0 or 1). It simplifies the data structure, allowing faster computations. Here's a brief overview of the binary quantization process applied to OpenAI embeddings:

  1. Load Embeddings: OpenAI embeddings are loaded from parquet files.
  2. Binary Transformation: The continuous valued vectors are converted into binary form. Here, values greater than 0 are set to 1, and others remain 0.
  3. Comparison & Retrieval: Binary vectors are used for comparison using logical XOR operations and other efficient algorithms.

Binary Quantization is a promising approach to improve retrieval speeds and reduce memory footprint of vector search engines. In this notebook we will show how to use Qdrant to perform binary quantization of vectors and perform fast similarity search on the resulting index.

Table of Contents

  1. Imports
  2. Download and Slice Dataset
  3. Create Qdrant Collection
  4. Indexing
  5. Search

1. Imports

!pip install qdrant-client pandas dataset --quiet --upgrade
import os
import random
import time

import datasets
import numpy as np
import pandas as pd
from qdrant_client import QdrantClient, models

random.seed(37)
np.random.seed(37)

2. Download and Slice Dataset

We will be using the dbpedia-entities dataset from the HuggingFace Datasets library. This contains 100K vectors of 1536 dimensions each

dataset = datasets.load_dataset(
    "Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K", split="train"
)
len(dataset)
100000
n_dim = len(dataset["text-embedding-3-small-1536-embedding"][0])
n_dim
1536
client = QdrantClient(  # assumes Qdrant is launched at localhost:6333
    prefer_grpc=True,
)

collection_name = "binary-quantization"

client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=n_dim,
        distance=models.Distance.DOT,
        on_disk=True,
    ),
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(always_ram=True),
    ),
)
True
def iter_dataset(dataset):
    for point in dataset:
        yield point["openai"], {"text": point["text"]}


vectors, payload = zip(*iter_dataset(dataset))
client.upload_collection(
    collection_name=collection_name,
    vectors=vectors,
    payload=payload,
    parallel=max(1, (os.cpu_count() // 2)),
)
collection_info = client.get_collection(collection_name=f"{collection_name}")
collection_info.dict()
{'status': <CollectionStatus.GREEN: 'green'>,
 'optimizer_status': <OptimizersStatusOneOf.OK: 'ok'>,
 'vectors_count': None,
 'indexed_vectors_count': 97760,
 'points_count': 100000,
 'segments_count': 7,
 'config': {'params': {'vectors': {'size': 1536,
    'distance': <Distance.DOT: 'Dot'>,
    'hnsw_config': None,
    'quantization_config': None,
    'on_disk': True,
    'datatype': None},
   'shard_number': 1,
   'sharding_method': None,
   'replication_factor': 1,
   'write_consistency_factor': 1,
   'read_fan_out_factor': None,
   'on_disk_payload': True,
   'sparse_vectors': None},
  'hnsw_config': {'m': 16,
   'ef_construct': 100,
   'full_scan_threshold': 10000,
   'max_indexing_threads': 0,
   'on_disk': False,
   'payload_m': None},
  'optimizer_config': {'deleted_threshold': 0.2,
   'vacuum_min_vector_number': 1000,
   'default_segment_number': 0,
   'max_segment_size': None,
   'memmap_threshold': None,
   'indexing_threshold': 20000,
   'flush_interval_sec': 5,
   'max_optimization_threads': None},
  'wal_config': {'wal_capacity_mb': 32, 'wal_segments_ahead': 0},
  'quantization_config': {'binary': {'always_ram': True}}},
 'payload_schema': {}}

Oversampling vs Recall

Preparing a query dataset

For the purpose of this illustration, we'll take a few vectors which we know are already in the index and query them. We should get the same vectors back as results from the Qdrant index.

query_indices = random.sample(range(len(dataset)), 100)
query_dataset = dataset[query_indices]
query_indices
[89391,
 79659,
 12006,
 80978,
 87219,
 97885,
 83155,
 67504,
 4645,
 82711,
 48395,
 57375,
 69208,
 14136,
 89515,
 59880,
 78730,
 36952,
 49620,
 96486,
 55473,
 58179,
 18926,
 6489,
 11931,
 54146,
 9850,
 71259,
 37825,
 47331,
 84964,
 92399,
 56669,
 77042,
 73744,
 47993,
 83780,
 92429,
 75114,
 4463,
 69030,
 81185,
 27950,
 66217,
 54652,
 8260,
 1151,
 993,
 85954,
 66863,
 47303,
 8992,
 92688,
 76030,
 29472,
 3077,
 42454,
 46120,
 69140,
 20877,
 2844,
 95423,
 1770,
 28568,
 96448,
 94227,
 40837,
 91684,
 29785,
 66936,
 85121,
 39546,
 81910,
 5514,
 37068,
 35731,
 93990,
 26685,
 63076,
 18762,
 27922,
 34916,
 80976,
 83189,
 6328,
 57508,
 58860,
 13758,
 72976,
 85030,
 332,
 34963,
 85009,
 31344,
 11560,
 58108,
 85163,
 17064,
 44712,
 45962]
## Add Gaussian noise to any vector


def add_noise(vector, noise=0.05):
    return vector + noise * np.random.randn(*vector.shape)
def correct(results, text):
    return text in [x.payload["text"] for x in results]


def count_correct(query_dataset, limit=1, oversampling=1, rescore=False):
    correct_results = 0
    for query_vector, text in zip(query_dataset["openai"], query_dataset["text"]):
        results = client.search(
            collection_name=collection_name,
            query_vector=add_noise(np.array(query_vector)),
            limit=limit,
            search_params=models.SearchParams(
                quantization=models.QuantizationSearchParams(
                    rescore=rescore,
                    oversampling=oversampling,
                )
            ),
        )
        correct_results += correct(results, text)
    return correct_results
limit_grid = [1, 3, 10, 20, 50]
oversampling_grid = [1.0, 3.0, 5.0]
rescore_grid = [True, False]
results = []

for limit in limit_grid:
    for oversampling in oversampling_grid:
        for rescore in rescore_grid:
            start = time.perf_counter()
            correct_results = count_correct(
                query_dataset, limit=limit, oversampling=oversampling, rescore=rescore
            )
            end = time.perf_counter()
            results.append(
                {
                    "limit": limit,
                    "oversampling": oversampling,
                    "bq_candidates": int(oversampling * limit),
                    "rescore": rescore,
                    "accuracy": correct_results / 100,
                    "total queries": len(query_dataset["text"]),
                    "time": end - start,
                }
            )
df = pd.DataFrame(results)

df[["limit", "oversampling", "rescore", "accuracy", "bq_candidates", "time"]]
# df.to_csv("candidates-rescore-time.csv", index=False)
limit oversampling rescore accuracy bq_candidates time
0 1 1.0 True 0.95 1 0.300152
1 1 1.0 False 0.85 1 0.244668
2 1 3.0 True 0.95 3 0.124406
3 1 3.0 False 0.83 3 0.171471
4 1 5.0 True 0.98 5 0.118219
5 1 5.0 False 0.87 5 0.111914
6 3 1.0 True 0.95 3 0.121328
7 3 1.0 False 0.92 3 0.267725
8 3 3.0 True 0.96 9 0.416834
9 3 3.0 False 0.90 9 0.410730
10 3 5.0 True 0.97 15 0.231671
11 3 5.0 False 0.93 15 0.252269
12 10 1.0 True 0.96 10 0.133462
13 10 1.0 False 0.92 10 0.285158
14 10 3.0 True 0.95 30 0.320695
15 10 3.0 False 0.98 30 0.457904
16 10 5.0 True 0.96 50 0.453204
17 10 5.0 False 0.94 50 0.450944
18 20 1.0 True 0.97 20 0.361066
19 20 1.0 False 0.95 20 0.585992
20 20 3.0 True 0.96 60 0.550389
21 20 3.0 False 0.96 60 0.618630
22 20 5.0 True 1.00 100 0.458241
23 20 5.0 False 0.95 100 0.441106
24 50 1.0 True 0.98 50 0.603967
25 50 1.0 False 0.96 50 0.514531
26 50 3.0 True 1.00 150 0.548153
27 50 3.0 False 0.98 150 0.608930
28 50 5.0 True 1.00 250 0.487522
29 50 5.0 False 0.99 250 0.313810

Why results for oversampling=1.0 and limit=1 with rescore=True are better than with rescore=False?

It might seem that with oversampling=1.0 and limit=1 Qdrant retrieves only 1 point, and it does not matter whether we rescore it or not, it should stay the same, but with a different score (from original vectors).

But in fact, there are 2 reasons why results are different: 1) HNSW is an approximate algorithm, and it might return different results for the same query. 2) Qdrant stores points in segments. When we do a query for 1 point, Qdrant looks for this one point in each segment, and then chooses the best match. 3) In this example we had 8 segments, Qdrant found 8 points with binary scores, replaced their scores with original vectors scores, and selected the best one from them, which led to a better accuracy.