VekterDB ======== The Basics ---------- Transform any `SQLAlchemy `_ compliant database into a vector database by adding **any** type of a `FAISS `_ index to index the vectors in order to utilize the many different types of approximate nearest neighbor (ANN) algorithms available in FAISS. Almost all available vector databases only allow users to select a limited subset of the ANN algorithms available in libraries like FAISS. The typical indices that are permitted are Hierarchical Navigable Small-World (HNSW) or Inverted File System (IVF), but both of these struggle to scale effectively beyond 10-100 million vectors [#f1]_ [#f2]_ [#f3]_. The aim of VekterDB is to provide users with greater flexibility in terms of both the ANN index and the type of databases they can use. VekterDB requires a minimum of two columns in the database table. The first is an integer based identification [0, N) that is used by the FAISS index to refer to records and serves as the primary key for the table. The second required column stores the vector as compressed bytes. Since VekterDB leverages FAISS, these vectors must be numpy arrays of type float32. Of course, additional columns may be specified. For example, you may have another column that serves as a more recognizable ID field that is a string. Usage ----- :: import numpy as np from vekterdb import VekterDB # Let's use some random data def generate_data(N:int=10_000, d:int=64, seed:int=1024): rng = np.random.default_rng(seed=seed) vectors = rng.normal(size=(N, d)).astype(np.float32) # Make vectors[1] be similar to vectors[0] vectors[1] = vectors[0] + rng.normal(scale=0.2, size=d).astype(np.float32) for idx, vector in enumerate(vectors): yield {"idx": idx, "vector": vector} records = generate_data() # Create new table "my_table" in an in-memory SQLite database vekter_db = VekterDB("my_table") # Insert vectors into the database n_records = vekter_db.insert(records) # Create a Flat FAISS index and save it in the file "my_table.index" vekter_db.create_index("my_table.index", faiss_factory_str="Flat") # Let's find the 3 nearest neighbors of idx=0 and idx=1 # These should be each other results = vekter_db.nearest_neighbors("idx", [0, 1], 3, "idx") """ Outputs the following [ {'idx': 0, 'neighbors': [ {'idx': 1, 'metric': 73.78341}, {'idx': 7537, 'metric': 39.142662}, {'idx': 9831, 'metric': 31.672077} ], }, {'idx': 1, 'neighbors': [ {'idx': 0, 'metric': 73.78341}, {'idx': 7537, 'metric': 38.97898}, {'idx': 9831, 'metric': 32.063244} ], }, ] """ Tutorial ---------------------------------------------------------------------------------------- In this tutorial, we will work through an example using one of the datasets found in the `Approximate Nearest Neighbors Benchmark `_, SIFT-1M. The data is stored in an HDF5 file and can be downloaded from this `link `_ (~500MB). The sift-128-euclidean.hdf5 file has the following keys: * train : 1M 128-dimensional numpy vectors * test : 10,000 128-dimensional numpy vectors * neighbors : 10,000 x 100 that list the idx's of top 100 neighbors of the corresponding test vector * distances : 10,000 x 100 that list the Euclidean distance between test & its neighbors Set up ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We will begin by setting up a new conda environment and installing needed packages. We need to install `FAISS `_ from conda and FAISS currently supports up to python 3.10 (see link if you want to use your gpu). We will also install h5py to be able to read the downloaded SIFT-1M dataset.:: $ conda create -n vekterdb_tutorial python=3.10 $ conda activate vekterdb_tutorial $ conda install -c pytorch -c conda-forge -c defaults h5py faiss-cpu=1.7.4 mkl=2021 blas=1.0=mkl $ pip install vekterdb The rest of the code assumes you are inside a Python interpretter (ipython, Jupyter, or whatever you like). Initialize VekterDB ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We begin by initializing a VekterDB. For this tutorial, we will use a SQLite database, but you could just as easily use a Postgres, MySQL, or any other of the SQLAlchemy `dialects `_. The SQLite database will be stored in local file "sift1m.db" in a table called "tutorial". Though the h5py file has just the two required fields for VekterDB, namely an integer identifier and the vector, we will add an additional string identifier for demonstration purposes. This column will be indexed by SQLite in order to allow a user to query for nearest neighbors using this identifier as well. :: import h5py import numpy as np import sqlalchemy as sa from vekterdb import VekterDB vekter_db = VekterDB( "tutorial", idx_name = "idx", vector_name = "vector", columns_dict = { "id": {"type": sa.types.Text, "unique": True, "nullable": False, "index": True}, }, url = "sqlite:///sift1m.db" ) Insert Records into the DB Table ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ With the database table, tutorial, now created in the database, sift1m.db, it is time to add the records from the HDF5 file. We will specify a function that will yield the records that will be inserted into the database table. Since we haven't specified a FAISS index yet, these records will only be added to the database table. The ``records_gen`` is a good candidate for parallelization. :: def records_gen(h5_file: str, test_data: bool = False): with h5py.File(h5_file, "r") as f: if test_data: vectors = f["test"] else: vectors = f["train"] for i, vector in enumerate(vectors): if test_data: i += 1_000_000 yield {"id": str(i), "idx": i, "vector": vector} train_records = records_gen("sift-128-euclidean.hdf5") n_records = vekter_db.insert(train_records) Create FAISS Index ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ With the database table now populated, we can construct the desired FAISS index which handles the vector similarity queries. In this tutorial we will utilize a more complicated index than is probably necessary, but we want to demonstrate using an index appropriate for way more than 1M vectors. For scalability, we will use an "IVF_HNSW,PQ" index. Specifically, let's use a "IVF5000_HNSW32,PQ32" index. This splits the 128-dimensional space into 5 * sqrt(1_000_000) = 5,000 partitions. The centroids of the 5,000 partitions will themselves be indexed using an HNSW32. To help save space, we will also use a Product Quantization to shrink the size of each vector into ~ 32 bytes, down from 512 bytes. The FAISS index will be saved to local disk in the "ivf_hnsw.index" file. The metric is set to "L2" to match the Euclidean distance used for the SIFT-1M dataset. We use all of the training data, 1M, to train the index. We pull 50,000 records from the database at any one time and also insert into FAISS at this amount. When adding vectors into the FAISS index, we will select the closest centroid from amongst a candidate pool of the nearest 25 centroids. If we had used just an "IVF5000,PQ32" index, we would compare each vector to all 5,000 centroids to determine which partition to insert the vector. :: vekter_db.create_index( "ivf_hnsw.index", "IVF5000_HNSW32,PQ32", metric="L2", batch_size=50_000, faiss_runtime_params="quantizer_efSearch=25", ) Querying for Similar Vectors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ VekterDB offers two ways in which to query for nearest neighbors. The first handles the cases when you have a vector that is **not** part of the database table, but want to find the records in the database table that are the most similar to the query. This is done with the ``search()`` method. The second is when you want to find the nearest neighbors in the database table to an existing record in the table. This is done with the ``nearest_neighbors()`` method. So far, we have only added the training data to the database and then used all those vectors to train the FAISS index. We begin by reading the testing data from the file and then using those to query our vector database for similar records. We will compare those to the "neighbors" stored in the HDF5 file. Let's start slow and just use the first test vector and get the five nearest neighbors. To easily compare, we only need to see the "idx" column. The ``search()`` returns a list with one element for each of the query vectors. Each element of the list is a dictionary with a single key, "neighbors", whose value is a list of dicts that are the neighbors for that query vectory. The dict's keys are the columns in the database table (or only those you specified with any ``*col_names``) and an additional ``"metric"`` which holds the similarity between that neighbor and the query vector. :: test_data = list(records_gen("sift-128-euclidean.hdf5", test_data=True)) f = h5py.File("sift-128-euclidean.hdf5", "r") true_neighbors = f["neighbors"] true_distances = f["distances"] neighbors = vekter_db.search(test_data[0]["vector"], 5, "idx")[0]["neighbors"] # Let's see just the first neighbor print(neighbors[0]) # {'idx': 695756, 'metric': 258.86288} # True nearest neighbor is print(true_neighbors[0][0]) # 932085 print(true_distances[0][0]) # 232.87122 You will likely notice that the true nearest neighbor, idx=932085 with distance=232.87, does NOT match the nearest record returned by our ``search()`` method. For me, I get idx=695756 with metric=258.86. But don't lose hope, because the search we did was using the default FAISS runtime search parameters. For the "IVF_HNSW,PQ" index, this means we are using an ``nprobe=1`` and ``quantizer_efSearch=16``. With only searching the nearest partition choosen from a candidate pool of 16, it is not surprising that our search failed to return the expected neighbor. Typically, setting ``nprobe`` to 2 - 5% of the number of partitions (5,000 for this index) yields acceptable results. We will use ``nprobe=175`` (3.5%). If ``nprobe=175``, then we should also increase ``quantizer_efSearch`` too so that the candidate pool is bigger than ``nprobe``. We will use ``quantizer_efSearch=350`` and then retry our test query. :: vekter_db.set_faiss_runtime_parameters("nprobe=175,quantizer_efSearch=350") neighbors = vekter_db.search(test_data[0]["vector"], 5, "idx")[0]["neighbors"] # Let's see just the first neighbor print(neighbors[0]) # {'idx': 932085, 'metric': 232.87122} # True nearest neighbor is print(true_neighbors[0][0]) # 932085 print(true_distances[0][0]) # 232.87122 Now that is more like it! If you still don't get the right answer, there are a few additional things you can try. If we had been more aggressive with the PQ, say as low as PQ8, then it's possible that the estimated distances that FAISS uses to find the nearest records is not great. In this case, you can increase the ``k_extra_neighbors`` from 0 (default) to say 30. This will return 5 + 30 nearest vectors and then they will be reranked based upon the true distance between the query vector and its neighbors with only the top 5 after reranking returned to you. If that doesn't work, you can experiment with increase ``nprobe`` and/or ``quantizer_efSearch`` some more. But be warned! Increase these values necessary slows down the querying time since more of the data is being checked. With a simple test done, let's query with all the test data and see how we do with the recall@1 metric. That is, do we find the true nearest neighbor in our top 1 nearest neighbor returned from the search? :: q_vecs = np.vstack([t["vector"] for t in test_data]) search_results = vekter_db.search(q_vecs, 1, "idx") found_nearest = 0 for i, search_result in enumerate(search_results): if true_neighbors[i][0] == search_result["neighbors"][0]["idx"]: found_nearest += 1 print(f"Recall@1 = {found_nearest / len(search_results):.04f}") In my running, I get a value of recall@1 = 0.6566. This is ok, but maybe a little disheartening. However, we only allowed FAISS to return the nearest approximate neighbor and we have quite a few parameters to tweak if needed as we mentioned above. We will start with raising the ``k_extra_neighbors`` value from 0 (default) to 4. :: search_results = vekter_db.search(q_vecs, 1, "idx", k_extra_neighbors=4) found_nearest = 0 for i, search_result in enumerate(search_results): if true_neighbors[i][0] == search_result["neighbors"][0]["idx"]: found_nearest += 1 print(f"Recall@1 = {found_nearest / len(search_results):.04f}") Much better. I got 0.9469. In fact, if we increase even further ``k_extra_neighbors=49``, then our result goes up to 0.9904. Again, our query time is increasing so it is always a trade off. On my machine the query time for ``q_vecs`` went from 1.03s (``k_extra_neighbors=0``) to 1.37s (4) to 4.41s (49). Querying for Similar Records ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In this section, we will show how to query for the nearest neighbors of a given record that is already in the database table. We begin by adding in the test vectors to our database table and FAISS index. We will then repeat our simple test, but this time using the ``nearest_neighbors()`` method. This method allows you to select your query records by specifying a SQLAlchemy Where clause, ``where_clause``. The rest follows the same as ``search()`` since ``nearest_neighbors`` is just combining a ``select()`` to retrieve the vectors of the requested records and then using ``search()``. Care is taken to remove the query record from the neighbors list. The ``nearest_neighbors()`` returns a list of dictionaries with each dictionary list the query record's values; either all the column values or just the values of the columns names specified with ``*col_names``. In addition, the "neighbors" key contains the list of neighbors and the metric between the neighbor and the query record. This list is sorted in appropriate order with nearest neighbor listed first. :: n_records = vekter_db.insert( test_data, batch_size=50_000, faiss_runtime_params="quantizer_efSearch=25" ) neighbors = vekter_db.nearest_neighbors( vekter_db.Record.idx == 1_000_000, 5, "idx" )[0]["neighbors"] if true_neighbors[0][0] == neighbors[0]["idx"]: print("We found the true nearest neighbor!") print(f" Found {neighbors[0]}") ground_truth = {"idx": true_neighbors[0][0], "metric": true_distances[0][0]} print(f"Ground truth {ground_truth}") else: print( "Yikes! something still went wrong. Some things to try\n" + "Increase the k_extra_neighbors from 0 to say 20." + " This pulls back some additional records and then reranks by true L2." + "\nOr increase nprobe some more" ) Having passed the simple test, let's do the full test data. To mix things up, let's show that we can also use the ``id`` column to specify the records to fetch. Because we had this column indexed in the specification of the database table, this should equivalant as using the primary key, ``idx``, of the database table. :: nn_results = vekter_db.nearest_neighbors( vekter_db.Record.id.in_([str(i) for i in range(1_000_000, 1_010_000)]), 1, "idx", k_extra_neighbors=4, ) found_nearest = 0 for i, nn_result in enumerate(nn_results): if true_neighbors[i][0] == nn_result["neighbors"][0]["idx"]: found_nearest += 1 print(f"{found_nearest / len(nn_results):.04f}") We get a recall@1 = 0.9323 with these runtime search parameters and ``k_extra_neighbors`` of 4. Notice that this value is a little lower than we had before, 0.9469. This is caused by the additional 10,000 test vectors added into the database table where a small percentage of them being the nearest neighbor for other records in the test data. Saving & Loading a VekterDB ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Having created a database table, populated it with records, created a FAISS index (including both training it and adding in vectors from the database table), we are now ready to save the VekterDB so that it can be loaded back up. Since the records are already stored in the database, there is nothing to do there. In fact, the FAISS index has also been saved off already as well. First right at the end of ``create_index()`` and then again after we called ``insert()`` to add in the test data. This leaves just saving off some of the metadata needed to reinitialize ``VekterDB`` which is done using the ``save()``. This writes a JSON file to disk of the needed information. If you don't provide a ``config_file`` name, ``save`` will write the file {table_name}.json to the local directory. In particular, this saves off any default FAISS runtime search parameters that you may have set with ``set_faiss_runtime_parameters()``. :: vekter_db.save() del vekter_db To test this, let's exit out of the Python interpretter/IPython/Jupyter notebook and import the needed libraries, load our ``VekterDB`` from disk, and then run a test query. Notice, that you do need to provide the connection URL to the database when calling ``load()``. This is done for security reasons since any username/password may be needed to pass in that URL string that we don't want to save to disk in plaintext. :: vekter_db = VekterDB.load("tutorial.json", url="sqlite:///sift1m.db") nn_results = vekter_db.nearest_neighbors( vekter_db.Record.id.in_([str(i) for i in range(1_000_000, 1_010_000)]), 1, "idx", k_extra_neighbors=4, ) found_nearest = 0 for i, nn_result in enumerate(nn_results): if true_neighbors[i][0] == nn_result["neighbors"][0]["idx"]: found_nearest += 1 print(f"{found_nearest / len(nn_results):.04f}") And we get back the same 0.9323 for the recall@1. Using GPUs ---------------------------------------------------------------------------------------- VekterDB does not directly support GPU usage at this time. This mainly stems from the differences across the different FAISS index types on how to transfer some or all of an index. However, this section will show some examples of how you can leverage GPUs that are outside VekterDB. Training on an IVF Index on a GPU ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Since our tutorial worked with an "IVF5000_HNSW32,PQ32" index, we will show how to train this on a GPU. This is taken directly from a very helpful `FAISS gist `_. For this example, we assume that records have already been inserted into the database table. Instead of calling the ``create_index()`` we will construct the index directly. Place the needed part(s) onto the GPU and then attach it to ``VekterDB``. After that we can use the ``train_index()`` and ``sync_index_to_db()``. :: import faiss from vekter_db import VekterDB vekter_db = VekterDB("tutorial", url="sqlite:///sift1m.db") Next we setup the FAISS index. For the "IVF_HNSW,PQ" index that we are doing, the call to ``faiss.extract_index_ivf()`` isn't necessary since the index is already an IVF index. However, if we had done a preprocessing step (e.g., "OPQ32,IVF_HNSW,PQ32") then this function call would extract out the needed IVF index from the ``faiss.IndexPreTransform``. For an IVF, the main piece that needs to go on the GPU is the clustering step. The ``clustering_index`` is not needed after training which is why we can just save the index (happens automatically in ``sync_index_to_db()``) without having to transfer the index back to the CPU before saving. :: index = faiss.index_factory(vekter_db.d, "IVF5000_HNSW32,PQ32", faiss.METRIC_L2) # Extract out the IVF index (only really needed if index has preprocessing step) index_ivf = faiss.extract_index_ivf(index) # Move the clustering index to the GPU (only thing that needs training) clustering_index = faiss.index_cpu_to_all_gpus(faiss.IndexFlatL2(vekter_db.d)) # Assign the clustering index now on the GPU to be the one used by index index_ivf.clustering_index = clustering_index Lastly, we attach the index to our ``VekterDB`` and proceed with the training. Afterwards, we add the vectors from the database table into the FAISS index which then saves the FAISS index to disk with filename of ``faiss_index``. :: vekter_db.attach(index, "ivf_hnsw_gpu_trained.index") vekter_db.train_index(sample_size=0) # Use all data to train vekter_db.sync_index_to_db(faiss_runtime_params="quantizer_efSearch=25") # Don't forget to set the faiss runtime parameters you like for searching vekter_db.set_faiss_runtime_parameters("nprobe=175,quantizer_efSearch=350") vekter_db.save("ivf_hnsw_gpu_trained.json") .. rubric:: Footnotes .. [#f1] I. Doshi, D. Da, A. Bhutani, R. Kumar, R. Bhatt, N. Balasubramanian, *LANNS: a web-scale approximate nearest neighbor lookup system*, Proceedings of the VLDB Endowment **15(4)**, 850 (2021). See also `arXiv:2010.09426 `_ .. [#f2] C. Fu, C. Xiang, C. Wang, and D. Cai. *Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph*, `arXiv:1707.00143 `_ (2017). .. [#f3] B. Riggs and G. Williams, `ANN Benchmarks: A Data Scientist's Journey to Billion Scale Performance `_ (Note: they actually only tested on 54M vectors)