vekterdb.vekterdb

VekterDB turns any SQLAlchemy compliant database into a vector database using the FAISS library to index the vectors.

Copyright (C) 2023 Matthew Hendrey

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class vekterdb.vekterdb.VekterDB(table_name: str, idx_name: str = 'idx', vector_name: str = 'vector', columns_dict: Dict[str, Dict] = {}, url: str = 'sqlite:///:memory:', connect_args: Dict = {}, faiss_index: str | None = None)[source]

Bases: object

Turn any SQLAlchemy compliant database into a vector database using the FAISS library to index the vectors.

VekterDB uses a minimum of two columns in the database table: idx_name (BigInteger, default ‘idx’) and vector_name (LargeBinary, default ‘vector’). Vectors are numpy arrays of np.float32, serialized to bytes using tobytes(), to comply with FAISS requirements.

Additional database table columns can be specified with columns_dict using SQLAlchemy’s Column arguments. For example, include a unique, indexed id field (str) and a non-unique product_category field (str) with:

my_db = VekterDB(
    "my_table",
    columns_dict={
        "id": {"type": Text, "unique": True, "nullable": False, "index": True},
        "product_category": {"type": Text},
    }
)
idx_name

Column name in the database table that stores integer [0, N) of the primary key.

Type:

str

vector_name

Column name in the database table that stores the vectors.

Type:

str

Session

Session factory that gives you a Session object to database table.

Type:

sa.orm.session.sessionmaker

Record

ORM-mapped class for the database table

Type:

sa.orm.decl_api.DeclarativeMeta

d

Dimensionality of the vectors

Type:

int

index

FAISS index that indexes the vector_name vectors for similarity search

Type:

faiss.Index

metric

Specifies the metric used to determine similarity.

Type:

str

Parameters:
  • table_name (str) – Database table name, either existing or new.

  • idx_name (str, optional) – Column name that stores the FAISS index integer ID and is the primary key for the database table. It must be unique and consecutive from [0, n_records). The default name is “idx”.

  • vector_name (str, optional) – Column name that stores the vector information. Default is “vector”.

  • columns_dict (Dict[str, Dict], optional) –

    Names (key) of additional columns to include in the table. The values are arguments that will be passed to SQLAlchemy’s Column. Default is {}.

    When connecting to an existing database table, this argument is not necessary.

  • url (str, optional) – URL string to connect to the database. Passed to SQLAlchemy’s create_engine. Default is “sqlite:///:memory”; an in-memory database.

  • connect_args (Dict, optional) – Any connection arguments to pass to SQLAlchemy’s create_engine. Default is {}.

  • faiss_index (str, optional) – If given, then load an existing FAISS index saved by that name. Default is None.

attach(index, faiss_index: str)[source]

Attach a FAISS index to this VekterDB

Parameters:
  • index (faiss.Index) – FAISS index

  • faiss_index (str) – Name of the file where FAISS index resides on disk

Raises:

ValueError – If index.metric_type doesn’t match L2 | INNER_PRODUCT

create_index(faiss_index: str, faiss_factory_str: str, metric: str = 'inner_product', sample_size: int = 0, batch_size: int = 50000, faiss_runtime_params: str | None = None)[source]

Create a FAISS index. Train, if needed, using sample_size vectors. Add vectors from database table to index. Save to disk when completed.

Parameters:
  • faiss_index (str) – Name of the file to save the resulting FAISS index to.

  • faiss_factory_str (str) – FAISS index factory string, passed to faiss.index_factory()

  • metric (str, optional) – Metric used by FAISS to determine similarity. Valid values are either inner_product or L2. Default is inner_product

  • sample_size (int, optional) – Number of training vectors. If 0, uses all vectors. Default is 0.

  • batch_size (int, optional) – Passed to sample_vectors() to pull vectors in batches of this size. Default is 50,000.

  • faiss_runtime_params (str, optional) – Set FAISS index runtime parameters before adding vectors. Useful if using a quantizer index (e.g., IVF50000_HNSW32) since index.add() searches against the quantizer index. E.g., “quantizer_efSearch=40”. self.faiss_runtime_parameters is set back to its value before function invocation. Default is None.

Raises:
  • FileExistsError – If a FAISS index is already assigned to this table.

  • FileExistsError – If the file faiss_index already exists on disk.

  • TypeError – If the metric is not either “inner_product” | “L2”

static deserialize_vector(vector_bytes: bytes) ndarray[source]

Static method to deserialize Python bytes to 1-d numpy vector.

Parameters:

vector_bytes (bytes) – Bytes representation of a vector.

Returns:

1-d numpy array of type np.float32

Return type:

np.ndarray

insert(records: Iterable[Dict], batch_size: int = 10000, serialize_vectors: bool = True, faiss_runtime_params: str | None = None, compression_level: int = 1) int[source]

Insert multiple records into the table. Vectors will also be added to the FAISS index if it already exists. If the FAISS index is updated, it is saved to disk.

Parameters:
  • records (Iterable[Dict]) – Each dictionary contains the column names as keys and their corresponding values.

  • batch_size (int, optional) – Number of records to insert at once. Default is 10,000.

  • serialize_vectors (bool, optional) – If True, vectors will be serialized and compressed before insertion; if False, user has already serialized & compressed the vectors. See serialized() for details. Default is True.

  • faiss_runtime_params (str, optional) – Set FAISS index runtime parameters before adding vectors. Likely only useful if you have a quantizer index (e.g., IVF12345_HNSW32). The quantizer index (HNSW32) will be used during the index.add() to determine which partition to add the vector to. You may want to change from the default, whether that is the FAISS default (efSearch=16) or the value saved in self.faiss_runtime_parameters. “quantizer_efSearch=40” would be an example value for the example index given. If used, self.faiss_runtime_parameters is set back to its value before function invocation. Default is None

  • compression_level (int, optional) – Zstandard compression level to apply to vectors before inserting into the database. Default is 1 (fastest)

Returns:

n_records – Number of records added to the table

Return type:

int

static load(config_file: str, url: str, connect_args: Dict = {})[source]

Load a VekterDB from a configuration file (JSON format) and connect to the specified database engine.

Parameters:
  • config_file (str) – Name of the configuration file for the VekterDB to load.

  • url (str) – URL string to connect to the database. See sa.create_engine() for details.

  • connect_args (Dict, optional) – Connection arguments to pass to sa.create_engine(). Default is {}

Return type:

VekterDB

nearest_neighbors(where_clause: ClauseElement, k_nearest_neighbors: int, *col_names: str, k_extra_neighbors: int = 0, rerank: bool = True, threshold: float | None = None, search_parameters=None, batch_size: int = 10000) List[Dict][source]

Return nearest neighbors of the records in the database table that match the where_clause. Optionally keep only neighbors whose similarity exceeds the threshold.

Parameters:
  • where_clause (sa.sql.ClauseElement) – Where clause specifying records of interest. Passed to select()

  • k_nearest_neighbors (int) – Number of nearest neighbors to return.

  • *col_names (str) – List of columns to use in the query and neighbor records. Default of None uses all columns.

  • k_extra_neighbors (int, optional) – Extra neighbors to return from FAISS index before reranking. If using a vector quantizer (e.g., PQ), FAISS orders results based upon the estimated similarities which likely differs from the true similarities calculated here. Default is 0.

  • rerank (bool, optional) – If True, rerank neighbors according to their true similarities. Otherwise the order is determined by the FAISS index’s index.search(). Default is True.

  • threshold (float, optional) – Only keep neighbors whose similarities exceed the threshold. Default is None which keeps all neighbors returned.

  • search_parameters (faiss.SearchParameters, optional) – Use these search parameters instead of the current runtime FAISS parameters. Passed to FAISS’s index.search(). See [FAISS documentation](https://github.com/facebookresearch/faiss/wiki/Setting-search-parameters-for-one-query)

  • batch_size (int, optional) – Batch size to use when retrieving neighbors information from the database table. Default is 10,000.

Returns:

For each query, a dictionary containing the query’s record and a list of the its neighbors’ records. A neighbor record includes the “metric” similarity.

Return type:

List[Dict]

sample_vectors(sample_size: int = 0, batch_size: int = 10000) ndarray[source]

Retrieve a sample of vectors from the database table.

Parameters:
  • sample_size (int, optional) – Number of vectors to return. Default 0 returns all vectors

  • batch_size (int, optional) – Pull vectors in batches of this size. Default is 10,000.

Returns:

2-d array of sampled vectors with shape (sample_size, d)

Return type:

np.ndarray

save(config_file: str | None = None)[source]

Saves configuration info to a JSON file, excluding the URL string for security. If set_faiss_runtime_parameters() has been called, it also saves and applies that setting when loading with VekterDB.load()

Parameters:

config_file (str, optional) – JSON file name to save to disk. If not provided, saves the file as table_name.json. The default is None.

search(query_vectors: ndarray, k_nearest_neighbors: int, *col_names: str, k_extra_neighbors: int = 0, rerank: bool = True, threshold: float | None = None, search_parameters=None, batch_size: int = 10000) List[List[Dict]][source]

Search for the k_nearest_neighbors records in the database table based on the similarity of their vectors to the query vectors. Optionally keep only neighbors whose similarity exceeds the threshold.

Parameters:
  • query_vectors (np.ndarray) – The query vectors to search with. Shape is (n, d) and dtype is np.float32

  • k_nearest_neighbors (int) – Number of nearest neighbors to return.

  • *col_names (str) – List of columns to use in a neighbor record. Default of None uses all columns.

  • k_extra_neighbors (int, optional) – Extra neighbors to return from FAISS index before reranking. If using a vector quantizer (e.g., PQ), FAISS orders results based upon the estimated similarities which likely differs from the true similarities calculated here. Default is 0.

  • rerank (bool, optional) – If True, rerank neighbors according to their true similarities. Otherwise the order is determined by the FAISS index’s index.search(). Default is True.

  • threshold (float, optional) – Only keep neighbors whose similarities exceed the threshold. Default is None which keeps all neighbors returned.

  • search_parameters (faiss.SearchParameters, optional) – Use these search parameters instead of the current runtime FAISS parameters. Passed to FAISS’s index.search(). See [FAISS documentation](https://github.com/facebookresearch/faiss/wiki/Setting-search-parameters-for-one-query)

  • batch_size (int, optional) – Batch size to use when retrieving neighbors information from the database table. Default is 10,000.

Returns:

For each query, a dictionary containing “neighbors” key whose value is a list of the neighbors’ requested information including the “metric” similarity to the query vector.

Return type:

List[Dict]

select(where_clause: ClauseElement, *ret_cols: str) Iterator[Dict][source]

Select records from the database table. Class member Record is the ORM-mapped class for the database table and should be used to construct the clause.

where = vekter_db.Record.idx == 0
where = vekter_db.Record.idx.in_([100, 200, 300])
where = sa.sql.and_(vekter_db.Record.idx>=0, vekter_db.Record.idx<5)

vekter_db.select(where)                   # Return all columns
vekter_db.select(where, "idx", "vector")  # Return idx & vector only
Parameters:
  • where_clause (sa.sql.ClauseElement) – SQLAlchemy Where Clause.

  • *ret_cols (str, optional) – List columns to return from the database table. Default of None returns all columns.

Yields:

Iterator[Dict] – Dictionary of requested fields that match the where clause

static serialize_vector(vector: ndarray, level=1) bytes[source]

Static method to serialize numpy vector to Python bytes. Uses zlib to compress the bytes.

Parameters:
  • vector (np.ndarray) – 1-d numpy array of type np.float32

  • level (int) – Zstandard compression level [1, 22]. Default is 1 (fastest)

Return type:

bytes

set_faiss_runtime_parameters(runtime_params_str: str)[source]

Change FAISS runtime parameters with a human-readable string. Parameters are separated by commas. For example, with the index ‘OPQ64,IVF50000_HNSW32,PQ64’, you can use “nprobe=50,quantizer_efSearch=100” to set both the nprobe in the IVF index and the efSearch in the HNSW quantizer index. If a parameter is not recognized, an exception is thrown.

Saves the provided settings in self.faiss_runtime_parameters

Parameters:

runtime_params_str (str) – Comma-separated list of parameters to set. For more details, see https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning#parameterspace-as-a-way-to-set-parameters-on-an-opaque-index

similarity(v1: ndarray, v2: ndarray, threshold: float | None = None) float[source]

Calculate the similarity between two vectors using the metric specified in create_index(). Currently only the inner product and L2 are supported. If the similarity fails to meet the threshold, None is returned.

Parameters:
  • v1 (np.ndarray) –

  • v2 (np.ndarray) –

  • threshold (float, optional) – Only return the value if similarity equals or exceeds this value. Default is None.

Returns:

similarity of v1 & v2

Return type:

float

sync_index_to_db(batch_size: int = 100000, faiss_runtime_params: str | None = None) int[source]

Add any vectors from the database that are not in the FAISS index.

Parameters:
  • batch_size (int, optional) – Add vectors in batches of this size. Default is 100_000.

  • faiss_runtime_params (str, optional) – Set FAISS index runtime parameters before adding vectors. Useful if using a quantizer index (e.g., IVF50000_HNSW32) since index.add() searches against the quantizer index. E.g., “quantizer_efSearch=40”. self.faiss_runtime_parameters is set back to its value before function invocation. Default is None.

Returns:

Number of records added into FAISS index

Return type:

int

Raises:

IndexError – Primary key of the records to be added must have consecutive integer values

train_index(sample_size: int = 0, batch_size: int = 50000, save_to_disk: bool = True)[source]

Pull vectors from database table to train the FAISS index.

Parameters:
  • sample_size (int, optional) – Number of training vectors. If 0, uses all vectors. Default is 0.

  • batch_size (int, optional) – Passed to sample_vectors() to pull vectors in batches of this size. Default is 50,000.

  • save_to_disk (bool, optional) – Save trained index to self.faiss_index on disk. Default is True.