vekterdb.vekterdb
VekterDB turns any SQLAlchemy compliant database into a vector database using the FAISS library to index the vectors.
Copyright (C) 2023 Matthew Hendrey
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class vekterdb.vekterdb.VekterDB(table_name: str, idx_name: str = 'idx', vector_name: str = 'vector', columns_dict: Dict[str, Dict] = {}, url: str = 'sqlite:///:memory:', connect_args: Dict = {}, faiss_index: str | None = None)[source]
Bases:
objectTurn any SQLAlchemy compliant database into a vector database using the FAISS library to index the vectors.
VekterDB uses a minimum of two columns in the database table: idx_name (BigInteger, default ‘idx’) and vector_name (LargeBinary, default ‘vector’). Vectors are numpy arrays of np.float32, serialized to bytes using tobytes(), to comply with FAISS requirements.
Additional database table columns can be specified with
columns_dictusing SQLAlchemy’sColumnarguments. For example, include a unique, indexed id field (str) and a non-unique product_category field (str) with:my_db = VekterDB( "my_table", columns_dict={ "id": {"type": Text, "unique": True, "nullable": False, "index": True}, "product_category": {"type": Text}, } )
- idx_name
Column name in the database table that stores integer [0, N) of the primary key.
- Type:
str
- vector_name
Column name in the database table that stores the vectors.
- Type:
str
- Session
Session factory that gives you a Session object to database table.
- Type:
sa.orm.session.sessionmaker
- Record
ORM-mapped class for the database table
- Type:
sa.orm.decl_api.DeclarativeMeta
- d
Dimensionality of the vectors
- Type:
int
- index
FAISS index that indexes the
vector_namevectors for similarity search- Type:
faiss.Index
- metric
Specifies the metric used to determine similarity.
- Type:
str
- Parameters:
table_name (str) – Database table name, either existing or new.
idx_name (str, optional) – Column name that stores the FAISS index integer ID and is the primary key for the database table. It must be unique and consecutive from [0, n_records). The default name is “idx”.
vector_name (str, optional) – Column name that stores the vector information. Default is “vector”.
columns_dict (Dict[str, Dict], optional) –
Names (key) of additional columns to include in the table. The values are arguments that will be passed to SQLAlchemy’s
Column. Default is {}.When connecting to an existing database table, this argument is not necessary.
url (str, optional) – URL string to connect to the database. Passed to SQLAlchemy’s
create_engine. Default is “sqlite:///:memory”; an in-memory database.connect_args (Dict, optional) – Any connection arguments to pass to SQLAlchemy’s
create_engine. Default is {}.faiss_index (str, optional) – If given, then load an existing FAISS index saved by that name. Default is None.
- attach(index, faiss_index: str)[source]
Attach a FAISS index to this VekterDB
- Parameters:
index (faiss.Index) – FAISS index
faiss_index (str) – Name of the file where FAISS index resides on disk
- Raises:
ValueError – If index.metric_type doesn’t match L2 | INNER_PRODUCT
- create_index(faiss_index: str, faiss_factory_str: str, metric: str = 'inner_product', sample_size: int = 0, batch_size: int = 50000, faiss_runtime_params: str | None = None)[source]
Create a FAISS index. Train, if needed, using
sample_sizevectors. Add vectors from database table to index. Save to disk when completed.- Parameters:
faiss_index (str) – Name of the file to save the resulting FAISS index to.
faiss_factory_str (str) – FAISS index factory string, passed to
faiss.index_factory()metric (str, optional) – Metric used by FAISS to determine similarity. Valid values are either
inner_productorL2. Default isinner_productsample_size (int, optional) – Number of training vectors. If 0, uses all vectors. Default is 0.
batch_size (int, optional) – Passed to
sample_vectors()to pull vectors in batches of this size. Default is 50,000.faiss_runtime_params (str, optional) – Set FAISS index runtime parameters before adding vectors. Useful if using a quantizer index (e.g., IVF50000_HNSW32) since
index.add()searches against the quantizer index. E.g., “quantizer_efSearch=40”.self.faiss_runtime_parametersis set back to its value before function invocation. Default isNone.
- Raises:
FileExistsError – If a FAISS index is already assigned to this table.
FileExistsError – If the file
faiss_indexalready exists on disk.TypeError – If the metric is not either “inner_product” | “L2”
- static deserialize_vector(vector_bytes: bytes) ndarray[source]
Static method to deserialize Python bytes to 1-d numpy vector.
- Parameters:
vector_bytes (bytes) – Bytes representation of a vector.
- Returns:
1-d numpy array of type np.float32
- Return type:
np.ndarray
- insert(records: Iterable[Dict], batch_size: int = 10000, serialize_vectors: bool = True, faiss_runtime_params: str | None = None, compression_level: int = 1) int[source]
Insert multiple records into the table. Vectors will also be added to the FAISS index if it already exists. If the FAISS index is updated, it is saved to disk.
- Parameters:
records (Iterable[Dict]) – Each dictionary contains the column names as keys and their corresponding values.
batch_size (int, optional) – Number of records to insert at once. Default is 10,000.
serialize_vectors (bool, optional) – If
True, vectors will be serialized and compressed before insertion; ifFalse, user has already serialized & compressed the vectors. Seeserialized()for details. Default is True.faiss_runtime_params (str, optional) – Set FAISS index runtime parameters before adding vectors. Likely only useful if you have a quantizer index (e.g., IVF12345_HNSW32). The quantizer index (HNSW32) will be used during the
index.add()to determine which partition to add the vector to. You may want to change from the default, whether that is the FAISS default (efSearch=16) or the value saved inself.faiss_runtime_parameters. “quantizer_efSearch=40” would be an example value for the example index given. If used,self.faiss_runtime_parametersis set back to its value before function invocation. Default isNonecompression_level (int, optional) – Zstandard compression level to apply to vectors before inserting into the database. Default is 1 (fastest)
- Returns:
n_records – Number of records added to the table
- Return type:
int
- static load(config_file: str, url: str, connect_args: Dict = {})[source]
Load a VekterDB from a configuration file (JSON format) and connect to the specified database engine.
- Parameters:
config_file (str) – Name of the configuration file for the VekterDB to load.
url (str) – URL string to connect to the database. See
sa.create_engine()for details.connect_args (Dict, optional) – Connection arguments to pass to
sa.create_engine(). Default is {}
- Return type:
- nearest_neighbors(where_clause: ClauseElement, k_nearest_neighbors: int, *col_names: str, k_extra_neighbors: int = 0, rerank: bool = True, threshold: float | None = None, search_parameters=None, batch_size: int = 10000) List[Dict][source]
Return nearest neighbors of the records in the database table that match the
where_clause. Optionally keep only neighbors whose similarity exceeds thethreshold.- Parameters:
where_clause (sa.sql.ClauseElement) – Where clause specifying records of interest. Passed to
select()k_nearest_neighbors (int) – Number of nearest neighbors to return.
*col_names (str) – List of columns to use in the query and neighbor records. Default of
Noneuses all columns.k_extra_neighbors (int, optional) – Extra neighbors to return from FAISS index before reranking. If using a vector quantizer (e.g., PQ), FAISS orders results based upon the estimated similarities which likely differs from the true similarities calculated here. Default is 0.
rerank (bool, optional) – If
True, rerank neighbors according to their true similarities. Otherwise the order is determined by the FAISS index’sindex.search(). Default isTrue.threshold (float, optional) – Only keep neighbors whose similarities exceed the
threshold. Default isNonewhich keeps all neighbors returned.search_parameters (faiss.SearchParameters, optional) – Use these search parameters instead of the current runtime FAISS parameters. Passed to FAISS’s
index.search(). See [FAISS documentation](https://github.com/facebookresearch/faiss/wiki/Setting-search-parameters-for-one-query)batch_size (int, optional) – Batch size to use when retrieving neighbors information from the database table. Default is 10,000.
- Returns:
For each query, a dictionary containing the query’s record and a list of the its neighbors’ records. A neighbor record includes the “metric” similarity.
- Return type:
List[Dict]
- sample_vectors(sample_size: int = 0, batch_size: int = 10000) ndarray[source]
Retrieve a sample of vectors from the database table.
- Parameters:
sample_size (int, optional) – Number of vectors to return. Default 0 returns all vectors
batch_size (int, optional) – Pull vectors in batches of this size. Default is 10,000.
- Returns:
2-d array of sampled vectors with shape (sample_size, d)
- Return type:
np.ndarray
- save(config_file: str | None = None)[source]
Saves configuration info to a JSON file, excluding the URL string for security. If
set_faiss_runtime_parameters()has been called, it also saves and applies that setting when loading withVekterDB.load()- Parameters:
config_file (str, optional) – JSON file name to save to disk. If not provided, saves the file as
table_name.json. The default is None.
- search(query_vectors: ndarray, k_nearest_neighbors: int, *col_names: str, k_extra_neighbors: int = 0, rerank: bool = True, threshold: float | None = None, search_parameters=None, batch_size: int = 10000) List[List[Dict]][source]
Search for the
k_nearest_neighborsrecords in the database table based on the similarity of their vectors to the query vectors. Optionally keep only neighbors whose similarity exceeds thethreshold.- Parameters:
query_vectors (np.ndarray) – The query vectors to search with. Shape is (n, d) and dtype is np.float32
k_nearest_neighbors (int) – Number of nearest neighbors to return.
*col_names (str) – List of columns to use in a neighbor record. Default of
Noneuses all columns.k_extra_neighbors (int, optional) – Extra neighbors to return from FAISS index before reranking. If using a vector quantizer (e.g., PQ), FAISS orders results based upon the estimated similarities which likely differs from the true similarities calculated here. Default is 0.
rerank (bool, optional) – If
True, rerank neighbors according to their true similarities. Otherwise the order is determined by the FAISS index’sindex.search(). Default isTrue.threshold (float, optional) – Only keep neighbors whose similarities exceed the
threshold. Default isNonewhich keeps all neighbors returned.search_parameters (faiss.SearchParameters, optional) – Use these search parameters instead of the current runtime FAISS parameters. Passed to FAISS’s
index.search(). See [FAISS documentation](https://github.com/facebookresearch/faiss/wiki/Setting-search-parameters-for-one-query)batch_size (int, optional) – Batch size to use when retrieving neighbors information from the database table. Default is 10,000.
- Returns:
For each query, a dictionary containing “neighbors” key whose value is a list of the neighbors’ requested information including the “metric” similarity to the query vector.
- Return type:
List[Dict]
- select(where_clause: ClauseElement, *ret_cols: str) Iterator[Dict][source]
Select records from the database table. Class member
Recordis the ORM-mapped class for the database table and should be used to construct the clause.where = vekter_db.Record.idx == 0 where = vekter_db.Record.idx.in_([100, 200, 300]) where = sa.sql.and_(vekter_db.Record.idx>=0, vekter_db.Record.idx<5) vekter_db.select(where) # Return all columns vekter_db.select(where, "idx", "vector") # Return idx & vector only
- Parameters:
where_clause (sa.sql.ClauseElement) – SQLAlchemy Where Clause.
*ret_cols (str, optional) – List columns to return from the database table. Default of
Nonereturns all columns.
- Yields:
Iterator[Dict] – Dictionary of requested fields that match the where clause
- static serialize_vector(vector: ndarray, level=1) bytes[source]
Static method to serialize numpy vector to Python bytes. Uses zlib to compress the bytes.
- Parameters:
vector (np.ndarray) – 1-d numpy array of type np.float32
level (int) – Zstandard compression level [1, 22]. Default is 1 (fastest)
- Return type:
bytes
- set_faiss_runtime_parameters(runtime_params_str: str)[source]
Change FAISS runtime parameters with a human-readable string. Parameters are separated by commas. For example, with the index ‘OPQ64,IVF50000_HNSW32,PQ64’, you can use “nprobe=50,quantizer_efSearch=100” to set both the nprobe in the IVF index and the efSearch in the HNSW quantizer index. If a parameter is not recognized, an exception is thrown.
Saves the provided settings in
self.faiss_runtime_parameters- Parameters:
runtime_params_str (str) – Comma-separated list of parameters to set. For more details, see https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning#parameterspace-as-a-way-to-set-parameters-on-an-opaque-index
- similarity(v1: ndarray, v2: ndarray, threshold: float | None = None) float[source]
Calculate the similarity between two vectors using the metric specified in
create_index(). Currently only the inner product and L2 are supported. If the similarity fails to meet the threshold,Noneis returned.- Parameters:
v1 (np.ndarray) –
v2 (np.ndarray) –
threshold (float, optional) – Only return the value if similarity equals or exceeds this value. Default is
None.
- Returns:
similarity of v1 & v2
- Return type:
float
- sync_index_to_db(batch_size: int = 100000, faiss_runtime_params: str | None = None) int[source]
Add any vectors from the database that are not in the FAISS index.
- Parameters:
batch_size (int, optional) – Add vectors in batches of this size. Default is 100_000.
faiss_runtime_params (str, optional) – Set FAISS index runtime parameters before adding vectors. Useful if using a quantizer index (e.g., IVF50000_HNSW32) since
index.add()searches against the quantizer index. E.g., “quantizer_efSearch=40”.self.faiss_runtime_parametersis set back to its value before function invocation. Default isNone.
- Returns:
Number of records added into FAISS index
- Return type:
int
- Raises:
IndexError – Primary key of the records to be added must have consecutive integer values
- train_index(sample_size: int = 0, batch_size: int = 50000, save_to_disk: bool = True)[source]
Pull vectors from database table to train the FAISS index.
- Parameters:
sample_size (int, optional) – Number of training vectors. If 0, uses all vectors. Default is 0.
batch_size (int, optional) – Passed to
sample_vectors()to pull vectors in batches of this size. Default is 50,000.save_to_disk (bool, optional) – Save trained index to
self.faiss_indexon disk. Default isTrue.