Pocket Dimension ================ The Basics ---------- Pocket Dimension provides a memory-efficient, dense, random projection of sparse vectors from potentially very high dimensions (~2.1 billion) down to much lower dimension dense vectors (~256). It does this by not storing the entire (2.1 billion x 256) projection matrix in memory. Instead it calculates only the columns it needs on the fly based upon the non-zero elements of any given sparse vector. This function is implemented in Numba to speed things up. Besides the random projection function, Pocket Dimension comes with two classes that can create either Term-Frequency or Term-Frequency, Inverse Document Frequency vectors into dense vectors from starting records. Usage ----- :: import numpy as np from pocket_dimension.vectorizer import TFVectorizer # Make some data. "one" and "two" should be similar & "abc" should be different records = [ {"id": "one", "features": [b"one", b"two", b"three"], "counts": [1, 2, 3]}, {"id": "two", "features": [b"one", b"two", b"three"], "counts": [2, 3, 4]}, {"id": "abc", "features": [b"abc", b"cde", b"efghi"], "counts": [9, 1, 3]}, ] # Create the TFVectorizer. Let's project down to 128-d embedder = TFVectorizer(128) X, ids = embedder(records) # let's check cosine similarity cosine_one_two = X[0].dot(X[1]) cosine_one_abc = X[0].dot(X[2]) print(f"Vectors 'one' and 'two' have cosine similarity = {cosine_one_two:.4f}") print(f"Vectors 'one' and 'abc' have cosine similarity = {cosine_one_abc:.4f}") # Vectors 'one' and 'two' have cosine similarity = 0.9926 # Vectors 'one' and 'abc' have cosine similarity = -0.0177 The Details ----------- Random projection is a dimension reduction technique that has some mathematical guarantees thanks to the Johnson-Lindenstrauss Lemma, though in practice it is common to get good results even if you blow past the mathematical guarantees to lower dimensions. Their are two different implementations within `scikit-learn.random_projection`, `GaussianRandomProjection` and `SparseRandomProjection`. The `GaussianRandomProjection` will make a dense projection matrix which will quickly exhaust RAM if the one dimension is large. The `SparseRandomProjection` will use a sparse matrix, but many of the values will necessarily be zero which can affect the quality of the projection. Pocket Dimension will generate only the needed columns of the projection matrix, but will make entries for every row in that column. So it is a combination of the two. Pocket Dimension leverages a hash function to effectively create random, but repeatable, :math:`\pm1/\sqrt d` entries which are scaled appropriately to preserve the magnitude of the vector and :math:`d` is the embedding dimension. If your vectors are sparse, then it is quick to calculate. The sparser the vector, the faster the processing. The Johnson-Lindenstrauss Lemma states that the distance squared between any pair of vectors after they have been randomly projected down to a smaller dimension will be within a multiple of :math:`1\pm\epsilon` of the original distance squared. .. math:: (1-\epsilon)\lVert \mathbf{u}-\mathbf{v} \rVert ^2 \leq \lVert \mathbf{A}\mathbf{u}-\mathbf{A}\mathbf{v} \rVert ^2 \leq (1+\epsilon)\lVert \mathbf{u}-\mathbf{v} \rVert^2 where :math:`\mathbf{A}` is the random projection matrix.