Binary Quantization with Qdrant & OpenAI Embedding
In the world of large-scale data retrieval and processing, efficiency is crucial. With the exponential growth of data, the ability to retrieve information quickly and accurately can significantly affect system performance. This blog post explores a technique known as binary quantization applied to OpenAI embeddings, demonstrating how it can enhance retrieval latency by 20x or more.
What Are OpenAI Embeddings?
OpenAI embeddings are numerical representations of textual information. They transform text into a vector space where semantically similar texts are mapped close together. This mathematical representation enables computers to understand and process human language more effectively.
Binary Quantization
Binary quantization is a method which converts continuous numerical values into binary values (0 or 1). It simplifies the data structure, allowing faster computations. Here's a brief overview of the binary quantization process applied to OpenAI embeddings:
- Load Embeddings: OpenAI embeddings are loaded from parquet files.
- Binary Transformation: The continuous valued vectors are converted into binary form. Here, values greater than 0 are set to 1, and others remain 0.
- Comparison & Retrieval: Binary vectors are used for comparison using logical XOR operations and other efficient algorithms.
Binary Quantization is a promising approach to improve retrieval speeds and reduce memory footprint of vector search engines. In this notebook we will show how to use Qdrant to perform binary quantization of vectors and perform fast similarity search on the resulting index.
Table of Contents
- Imports
- Download and Slice Dataset
- Create Qdrant Collection
- Indexing
- Search
1. Imports
!pip install qdrant-client pandas dataset --quiet --upgrade
import os
import random
import time
import datasets
import numpy as np
import pandas as pd
from qdrant_client import QdrantClient, models
random.seed(37)
np.random.seed(37)
2. Download and Slice Dataset
We will be using the dbpedia-entities dataset from the HuggingFace Datasets library. This contains 100K vectors of 1536 dimensions each
dataset = datasets.load_dataset(
"Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K", split="train"
)
len(dataset)
n_dim = len(dataset["text-embedding-3-small-1536-embedding"][0])
n_dim
client = QdrantClient( # assumes Qdrant is launched at localhost:6333
prefer_grpc=True,
)
collection_name = "binary-quantization"
client.create_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(
size=n_dim,
distance=models.Distance.DOT,
on_disk=True,
),
quantization_config=models.BinaryQuantization(
binary=models.BinaryQuantizationConfig(always_ram=True),
),
)
def iter_dataset(dataset):
for point in dataset:
yield point["openai"], {"text": point["text"]}
vectors, payload = zip(*iter_dataset(dataset))
client.upload_collection(
collection_name=collection_name,
vectors=vectors,
payload=payload,
parallel=max(1, (os.cpu_count() // 2)),
)
collection_info = client.get_collection(collection_name=f"{collection_name}")
collection_info.dict()
Oversampling vs Recall
Preparing a query dataset
For the purpose of this illustration, we'll take a few vectors which we know are already in the index and query them. We should get the same vectors back as results from the Qdrant index.
query_indices = random.sample(range(len(dataset)), 100)
query_dataset = dataset[query_indices]
query_indices
## Add Gaussian noise to any vector
def add_noise(vector, noise=0.05):
return vector + noise * np.random.randn(*vector.shape)
def correct(results, text):
return text in [x.payload["text"] for x in results]
def count_correct(query_dataset, limit=1, oversampling=1, rescore=False):
correct_results = 0
for query_vector, text in zip(query_dataset["openai"], query_dataset["text"]):
results = client.search(
collection_name=collection_name,
query_vector=add_noise(np.array(query_vector)),
limit=limit,
search_params=models.SearchParams(
quantization=models.QuantizationSearchParams(
rescore=rescore,
oversampling=oversampling,
)
),
)
correct_results += correct(results, text)
return correct_results
limit_grid = [1, 3, 10, 20, 50]
oversampling_grid = [1.0, 3.0, 5.0]
rescore_grid = [True, False]
results = []
for limit in limit_grid:
for oversampling in oversampling_grid:
for rescore in rescore_grid:
start = time.perf_counter()
correct_results = count_correct(
query_dataset, limit=limit, oversampling=oversampling, rescore=rescore
)
end = time.perf_counter()
results.append(
{
"limit": limit,
"oversampling": oversampling,
"bq_candidates": int(oversampling * limit),
"rescore": rescore,
"accuracy": correct_results / 100,
"total queries": len(query_dataset["text"]),
"time": end - start,
}
)
df = pd.DataFrame(results)
df[["limit", "oversampling", "rescore", "accuracy", "bq_candidates", "time"]]
# df.to_csv("candidates-rescore-time.csv", index=False)
Why results for oversampling=1.0 and limit=1 with rescore=True are better than with rescore=False?
It might seem that with oversampling=1.0 and limit=1 Qdrant retrieves only 1 point, and it does not matter whether we rescore it or not, it should stay the same, but with a different score (from original vectors).
But in fact, there are 2 reasons why results are different: 1) HNSW is an approximate algorithm, and it might return different results for the same query. 2) Qdrant stores points in segments. When we do a query for 1 point, Qdrant looks for this one point in each segment, and then chooses the best match. 3) In this example we had 8 segments, Qdrant found 8 points with binary scores, replaced their scores with original vectors scores, and selected the best one from them, which led to a better accuracy.