Binary Quantization with Qdrant & OpenAI Embedding

In the world of large-scale data retrieval and processing, efficiency is crucial. With the exponential growth of data, the ability to retrieve information quickly and accurately can significantly affect system performance. This blog post explores a technique known as binary quantization applied to OpenAI embeddings, demonstrating how it can enhance retrieval latency by 20x or more.

What Are OpenAI Embeddings?

OpenAI embeddings are numerical representations of textual information. They transform text into a vector space where semantically similar texts are mapped close together. This mathematical representation enables computers to understand and process human language more effectively.

Binary Quantization

Binary quantization is a method which converts continuous numerical values into binary values (0 or 1). It simplifies the data structure, allowing faster computations. Here's a brief overview of the binary quantization process applied to OpenAI embeddings:

Load Embeddings: OpenAI embeddings are loaded from parquet files.
Binary Transformation: The continuous valued vectors are converted into binary form. Here, values greater than 0 are set to 1, and others remain 0.
Comparison & Retrieval: Binary vectors are used for comparison using logical XOR operations and other efficient algorithms.

Binary Quantization is a promising approach to improve retrieval speeds and reduce memory footprint of vector search engines. In this notebook we will show how to use Qdrant to perform binary quantization of vectors and perform fast similarity search on the resulting index.

1. Imports

!pip install qdrant-client pandas dataset --quiet --upgrade

import os
import random
import time

import datasets
import numpy as np
import pandas as pd
from qdrant_client import QdrantClient, models

random.seed(37)
np.random.seed(37)

2. Download and Slice Dataset

We will be using the dbpedia-entities dataset from the HuggingFace Datasets library. This contains 100K vectors of 1536 dimensions each

dataset = datasets.load_dataset(
    "Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K", split="train"
)
len(dataset)

n_dim = len(dataset["text-embedding-3-small-1536-embedding"][0])
n_dim

client = QdrantClient(  # assumes Qdrant is launched at localhost:6333
    prefer_grpc=True,
)

collection_name = "binary-quantization"

client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=n_dim,
        distance=models.Distance.DOT,
        on_disk=True,
    ),
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(always_ram=True),
    ),
)

True

def iter_dataset(dataset):
    for point in dataset:
        yield point["openai"], {"text": point["text"]}


vectors, payload = zip(*iter_dataset(dataset))
client.upload_collection(
    collection_name=collection_name,
    vectors=vectors,
    payload=payload,
    parallel=max(1, (os.cpu_count() // 2)),
)

collection_info = client.get_collection(collection_name=f"{collection_name}")
collection_info.dict()

{'status': <CollectionStatus.GREEN: 'green'>,
 'optimizer_status': <OptimizersStatusOneOf.OK: 'ok'>,
 'vectors_count': None,
 'indexed_vectors_count': 97760,
 'points_count': 100000,
 'segments_count': 7,
 'config': {'params': {'vectors': {'size': 1536,
    'distance': <Distance.DOT: 'Dot'>,
    'hnsw_config': None,
    'quantization_config': None,
    'on_disk': True,
    'datatype': None},
   'shard_number': 1,
   'sharding_method': None,
   'replication_factor': 1,
   'write_consistency_factor': 1,
   'read_fan_out_factor': None,
   'on_disk_payload': True,
   'sparse_vectors': None},
  'hnsw_config': {'m': 16,
   'ef_construct': 100,
   'full_scan_threshold': 10000,
   'max_indexing_threads': 0,
   'on_disk': False,
   'payload_m': None},
  'optimizer_config': {'deleted_threshold': 0.2,
   'vacuum_min_vector_number': 1000,
   'default_segment_number': 0,
   'max_segment_size': None,
   'memmap_threshold': None,
   'indexing_threshold': 20000,
   'flush_interval_sec': 5,
   'max_optimization_threads': None},
  'wal_config': {'wal_capacity_mb': 32, 'wal_segments_ahead': 0},
  'quantization_config': {'binary': {'always_ram': True}}},
 'payload_schema': {}}

Oversampling vs Recall

Preparing a query dataset

For the purpose of this illustration, we'll take a few vectors which we know are already in the index and query them. We should get the same vectors back as results from the Qdrant index.

query_indices = random.sample(range(len(dataset)), 100)
query_dataset = dataset[query_indices]
query_indices

## Add Gaussian noise to any vector


def add_noise(vector, noise=0.05):
    return vector + noise * np.random.randn(*vector.shape)

def correct(results, text):
    return text in [x.payload["text"] for x in results]


def count_correct(query_dataset, limit=1, oversampling=1, rescore=False):
    correct_results = 0
    for query_vector, text in zip(query_dataset["openai"], query_dataset["text"]):
        results = client.search(
            collection_name=collection_name,
            query_vector=add_noise(np.array(query_vector)),
            limit=limit,
            search_params=models.SearchParams(
                quantization=models.QuantizationSearchParams(
                    rescore=rescore,
                    oversampling=oversampling,
                )
            ),
        )
        correct_results += correct(results, text)
    return correct_results

limit_grid = [1, 3, 10, 20, 50]
oversampling_grid = [1.0, 3.0, 5.0]
rescore_grid = [True, False]
results = []

for limit in limit_grid:
    for oversampling in oversampling_grid:
        for rescore in rescore_grid:
            start = time.perf_counter()
            correct_results = count_correct(
                query_dataset, limit=limit, oversampling=oversampling, rescore=rescore
            )
            end = time.perf_counter()
            results.append(
                {
                    "limit": limit,
                    "oversampling": oversampling,
                    "bq_candidates": int(oversampling * limit),
                    "rescore": rescore,
                    "accuracy": correct_results / 100,
                    "total queries": len(query_dataset["text"]),
                    "time": end - start,
                }
            )

df = pd.DataFrame(results)

df[["limit", "oversampling", "rescore", "accuracy", "bq_candidates", "time"]]
# df.to_csv("candidates-rescore-time.csv", index=False)

	limit	oversampling	rescore	accuracy	bq_candidates	time
0	1	1.0	True	0.95	1	0.300152
1	1	1.0	False	0.85	1	0.244668
2	1	3.0	True	0.95	3	0.124406
3	1	3.0	False	0.83	3	0.171471
4	1	5.0	True	0.98	5	0.118219
5	1	5.0	False	0.87	5	0.111914
6	3	1.0	True	0.95	3	0.121328
7	3	1.0	False	0.92	3	0.267725
8	3	3.0	True	0.96	9	0.416834
9	3	3.0	False	0.90	9	0.410730
10	3	5.0	True	0.97	15	0.231671
11	3	5.0	False	0.93	15	0.252269
12	10	1.0	True	0.96	10	0.133462
13	10	1.0	False	0.92	10	0.285158
14	10	3.0	True	0.95	30	0.320695
15	10	3.0	False	0.98	30	0.457904
16	10	5.0	True	0.96	50	0.453204
17	10	5.0	False	0.94	50	0.450944
18	20	1.0	True	0.97	20	0.361066
19	20	1.0	False	0.95	20	0.585992
20	20	3.0	True	0.96	60	0.550389
21	20	3.0	False	0.96	60	0.618630
22	20	5.0	True	1.00	100	0.458241
23	20	5.0	False	0.95	100	0.441106
24	50	1.0	True	0.98	50	0.603967
25	50	1.0	False	0.96	50	0.514531
26	50	3.0	True	1.00	150	0.548153
27	50	3.0	False	0.98	150	0.608930
28	50	5.0	True	1.00	250	0.487522
29	50	5.0	False	0.99	250	0.313810

Why results for oversampling=1.0 and limit=1 with rescore=True are better than with rescore=False?

It might seem that with oversampling=1.0 and limit=1 Qdrant retrieves only 1 point, and it does not matter whether we rescore it or not, it should stay the same, but with a different score (from original vectors).

But in fact, there are 2 reasons why results are different: 1) HNSW is an approximate algorithm, and it might return different results for the same query. 2) Qdrant stores points in segments. When we do a query for 1 point, Qdrant looks for this one point in each segment, and then chooses the best match. 3) In this example we had 8 segments, Qdrant found 8 points with binary scores, replaced their scores with original vectors scores, and selected the best one from them, which led to a better accuracy.

Binary Quantization with Qdrant & OpenAI Embedding

What Are OpenAI Embeddings?