pocket_dimension.random_projection

Pocket Dimension provides a memory-efficient, dense, random projection of sparse vectors and then applies this to Term Frequency (TF) and Term Frequency, Inverse Document Frequency (TFIDF) data.

Copyright (C) 2022 Matthew Hendrey & Brendan Murphy

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class pocket_dimension.random_projection.JustInTimeRandomProjection(n_components='auto', *, eps=0.1)[source]

Bases: sklearn.random_projection.BaseRandomProjection

Reduce the dimensionality of sparse vectors using a dense random projection matrix in a memory-efficient way.

This is an alternative to scikit-learn.random_projection.SparseRandomProjection. This implementation provides higher quality embedding and is constant in the amount of RAM needed, irrespective of the sparse starting dimension. It achieves this by not storing the projection matrix in memory, but instead generates only the needed columns just-in-time for any given input vector. This makes the same as a SparseRandomProjection when the density = 1.0.

The elements of the projection matrix, \(\pm 1 / \sqrt{n\_components}\), are generated as needed using a hashing function to provide a random choice of the sign. This uses numba functions to enable faster computation and is parallelized across the available cpus.

This allows you to utilize sparse starting dimension of 2**31 - 1 without any issues. For really values of n_components transforming will also be faster than SparseRandomProjection.

Parameters
  • n_components (int or 'auto', default = 'auto') –

    Dimensionality of the target projection space. If not a multiple of 64, then it will be rounded down to the nearest multiple of 64.

    If ‘auto’, then it will pick a dimensionality that satisfies the Johnson-Lindenstrauss Lemma based upon the eps parameter. The dimensionality will then be rounded up to the nearest multiple of 64 to ensure you satisfy the conditions in the lemma.

    NOTE This can yield a very conservative estimate of the required dimensionality.

  • eps (float, default = 0.1) – Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss Lemma. This is used if n_components is ‘auto’. This represents the error rate between distances in the sparse dimension and the resulting lower embedding dimension. Smaller values of eps give larger values of n_components.

n_components_

Dimensionality of the embedding dimension. It will be a multiple of 64.

Type

int

Examples

import numpy as np
from pocket_dimension.random_projection import (
    JustInTimeRandomProjection,
    random_sparse_vectors,
)

n_components = 256
sparse_dim = 1_073_741_824
n_samples = 100_000
X = random_sparse_vectors(n_samples, sparse_dim=sparse_dim, normalize=True)

transformer = JustInTimeRandomProjection(n_components)
# No need to fit if provide a value for n_components at initialization
X_new = transformer.transform(X)
X_new.shape
# (25, 64)
fit(X, y=None)[source]

This is essential a no-op function since there is no need to generate a random projection matrix. Instead, this is validating input data and if n_components is ‘auto’, then it sets it according to the Johnson-Lindenstrauss lemma limits based upon X.shape[1] and eps.

Parameters
  • X (csr_matrix) – Only the shape is used to find the optimal value of n_components to satisfy Johnson-Lindenstrauss lemma theoretical guarantees.

  • y (Ignored) –

Returns

self – JustInTimeRandomProjection class instance

Return type

object

transform(X: scipy.sparse._csr.csr_matrix) → numpy.ndarray[source]

Project the data using the just-in-time generated dense random projection matrix.

Parameters

X (csr_matrix) – The input data to project into a smaller dimensional space. Shape is (n_samples, sparse_dim)

Returns

Projected data of shape (n_samples, n_components)

Return type

np.ndarray

pocket_dimension.random_projection.distributional_johnson_lindenstrauss_optimal_delta(sparse_dim: int, n_components: int, eps: float) → float[source]

Algorithm to find the optimal failure rate, delta, for a given eps (error rate), sparse_dim, and n_components (embedding dimension) for the Distributional Johnson-Lindenstrauss Lemma which is a different formulation of the problem. Taken from

M. Skorski. Johnson-Lindenstrauss Transforms with Best Confidence, Proceedings of Machine Learning Research 134, 1 (2021)

which can be found at http://proceedings.mlr.press/v134/skorski21a/skorski21a.pdf

If \(\mathbf{A}\) is the random projection matrix, then \(\delta\) is the probability of exceeding error limits given by

\[\delta = \mathbb{P} \lbrack \vert \Vert \mathbf{A}x \Vert^2_2 - \Vert x \Vert^2_2 \vert > \epsilon \Vert x \Vert^2_2 {\rbrack}\]
Parameters
  • sparse_dim (int) – Original dimension of the data

  • n_components (int) – Embedding dimension

  • eps (float) – Error rate

Returns

delta – The best probability of failure that you can expect

Return type

float

pocket_dimension.random_projection.random_projection[source]

Randomly project a sparse matrix (standard CSR representation) to a dense vectors by effectively multiplying by a random matrix whose elements are \(\pm 1 / \sqrt{d}\). The projection matrix is never stored in memory. Instead the elements are generated as needed using splitmix64(). Note Embedding dimension, d, is rounded to a multiple of 64 since we generate random bits in batches of 64 in order to utilize bit manipulation to speed things up.

The column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and the corresponding values are stored in data[indptr[i]:indptr[i+1]].

Parameters
  • data (np.ndarray, shape=(nnz,), dtype=float32) – The values of the nonzero elements in the sparse csr_matrix where nnz = number of nonzero elements

  • indices (np.ndarray, shape=(nnz,), dtype=int64) – The column indices of the nonzero elements in the sparse csr_matrix

  • indptr (np.ndarray, shape=(n_rows+1,), dtype=int64) – Pointers into data and indices to indicate where the rows start and stop. If you have just a single record, then indtpr=[0, len(data)]

  • d (int) – Embedding dimension of dense vectors.

Returns

X – Dense 2-d array containing the randomly projected dense vectors for each row of the input sparse matrix.

Return type

np.ndarray, shape=(n_rows, d), dtype=float32

pocket_dimension.random_projection.random_sparse_vectors(n_samples: int, *, sparse_dim: int = 2147483647, min_n_features: int = 20, max_n_features: int = 100, normalize: bool = False, rng: numpy.random._generator.Generator = None) → scipy.sparse._csr.csr_matrix[source]

Randomly generate sparse vectors for testing

Parameters
  • n_samples (int) – Number of sparse vectors to create

  • sparse_dim (int, default = 2**31 - 1) – Dimensionality of the sparse vectors.

  • min_n_features (int, default = 10) – Minimum number of features any given sparse vector may have

  • max_n_features (int, default = 100) – Maximum number of features any given sparse vector may have

  • normalize (bool, default = False) – If true, normalizes the sparse vectors to all have unit length.

  • rng (np.random.Generator, default = None) – A numpy random generator. If None, then one is created

Returns

Shape is (n_samples, sparse_dim)

Return type

csr_matrix

pocket_dimension.random_projection.splitmix64[source]

Fast, simple function for taking an integer and returning a random number. Function used in Java random number generator.

Parameters

index (uint64) –

Returns

Return type

uint64

pocket_dimension.vectorizer

Pocket Dimension provides a memory-efficient, dense, random projection of sparse vectors and then applies this to Term Frequency (TF) and Term Frequency, Inverse Document Frequency (TFIDF) data.

Copyright (C) 2022 Matthew Hendrey & Brendan Murphy

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class pocket_dimension.vectorizer.TFIDFVectorizer(d: int, cms_file: Union[str, pathlib.Path], *, hash_dim: int = 2147483647, minDF: int = 1, maxDF: int = 4294967295, temperature: float = 1.0, min_n_features: int = 1, min_n_observations: int = 1, filter: str = None, filter_out: bool = True)[source]

Bases: pocket_dimension.vectorizer.TFVectorizer

Randomly project records by first applying a high-dimensional feature hasher to create a sparse vector representation of the tf-idf and then applying the random projection to a dense embedding dimension.

\[tfidf = c^{1/temp} \log{10}(n\_records / (doc\_freq + 1))\]
class pocket_dimension.vectorizer.TFVectorizer(d: int, *, cms_file: Union[str, pathlib.Path] = None, hash_dim: int = 2147483647, minDF: int = 1, maxDF: int = 4294967295, temperature: float = 1.0, min_n_features: int = 1, min_n_observations: int = 1, filter: str = None, filter_out: bool = True)[source]

Bases: object

Calculate a Term-Frequency vector representation by first using sklearn’s FeatureHasher to create a high-dimensional, sparse vector representation. Then apply the JustInTimeRandomProjection to return a lower-dimensional dense vector representation. The resulting vectors will be normalized to unit length.

Parameters
  • d (int) – Dimension of the dense vector output. Rounded down to nearest multiple of 64 if not already

  • cms_file (str | Path, optional) – Filename of a saved count-min sketch to use for filtering out features whose document frequency falls outside of [minDF, maxDF]. Default is None

  • hash_dim (int, optional) – Number of dimensions that features get hashed to for the sparse vector representation. Should have hash_dim >> d. Default is the largest possible 2**31 - 1 (~2 billion) for sklearn’s FeatureHasher

  • minDF (int, optional) – Minimum document frequency (number of records with this feature) a feature must have to be included in the embedding. Default is 1

  • maxDF (int, optional) – Maximum document frequency (number of records with this feature) a feature can have to be included in the embedding. Default is 2**32 - 1.

  • temperature (float, optional) – Option to reshape the term frequency vector by raising each count to (1/temperature). Using a value above 1.0 flattens the values relative to each other. Using a value below 1.0 sharpens the contrast of values relative to each other.

  • min_n_features (int) – Must have at least this many features in order to create a vector. Useful as a data quality check. Default is 1

  • min_n_observations (int) – Must have at least his many observations in order to create a vector. Useful as a data quality check. This is the sum of counts over the different features for a given item. Default is 1

  • filter (pybloomfilter3, option) – Bloom filter to use for filtering features before creating sparse vectors. Default is None

  • filter_out (bool, optional) – If True, then remove features that are in the filter before making the sparse vector. If False, then only use features that are in the filter in the sparse vector. Not used if filter is None. Default is True.

d

Embedding dimension of the dense vector. Must be multiple of 64

Type

int

hash_dim

Dimension of the sparse vector. Must be <= 2**31-1

Type

int

hasher

Hashes input features (bytes) to integer representing corresponding dimension

Type

sklearn.feature_extraction.FeatureHasher

jit_rp

Efficiently projects high-dimensional sparse vectors down to lower dimensional dense vectors

Type

JustInTimeRandomProjection

temperature

Exponential scale raw counts by 1 / temperature when making TF vector

Type

float

min_n_features

Must have at least this many features in order to create a vector

Type

int

min_n_observations

Must have at least his many observations in order to create a vector

Type

int

cms

A count-min sketch that stores document frequency info

Type

sketchnu.CountMinLinear | sketchnu.CountMinLog16 | sketchnu.CountMinLog8

minDF

Minimum document frequency a feature must have to be incuded in the vector

Type

int

maxDF

Maximum document frequency a feature can have to be included in the vector

Type

int

filter

BloomFilter containing features to be filtered

Type

pybloomfilter.BloomFilter

filter_out

If True, then any feature found in the filter will be excluded from the vector. If False, then only features found in the filter are included in vector.

Type

bool

filter_reweight_features(features: List[bytes], counts: List[int]) → List[Tuple[bytes, float]][source]

Filter the features accordingly and reweight the associated counts. The counts are scaled by raising them to \(1/temperature\) and then multiplied by the idf value. For TFVectorizer \(idf=1.0\).

Features will be filtered out if their document frequency is not within [minDF, maxDF] and if a filter is provided during initialization. Only those features that pass the filters will contribute to the final vector.

After filtering, a final check is done to ensure that there are at least min_n_features features remaining and that there are at least min_n_observations observations of those features. Otherwise an empty list is returned.

Parameters
  • features (List[bytes]) – List of the individual features. These will be hashed by the FeatureHasher to turn into a sparse vector representation

  • counts (List[int]) – Number of times that feature is present in a given record.

Returns

Return type

List[Tuple[bytes, float]]

yield_record(records: Iterable[Dict], ids: List)[source]

Yields a list of an individual record’s (features, values) where the values are the counts after reweighting them by temperature and/or idf.

Parameters
  • records (Iterable[Dict]) – An iterable of records where each records is a dict of {“id”: str, “features”: List[bytes], “counts”: List[int]}

  • ids (List) – Pass in an empty list which will be returned with corresponding ids from the records passed in

Yields

List[Tuple[bytes, float]] – Of the appropriate individual (feature, value) for a given record

pocket_dimension.vectorizer.numba_idf[source]

Numba function to calculate the idf value for a doc_freq

Parameters
  • doc_freq (float32) – Number of records that contain this particular feature.

  • n_records (float32) – Total number of records in the data set.

Returns

Return type

float32

tests.test_pocket_dimension

Pocket Dimension provides a memory-efficient, dense, random projection of sparse vectors and then applies this to Term Frequency (TF) and Term Frequency, Inverse Document Frequency (TFIDF) data.

Copyright (C) 2022 Matthew Hendrey & Brendan Murphy

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

tests.test_pocket_dimension.sparse_distance_squared(S1: scipy.sparse._csr.csr_matrix, S2: scipy.sparse._csr.csr_matrix) → float[source]

Calculate the distance between two sparse vectors

tests.test_pocket_dimension.test_distributional_johnson_lindenstrauss(n: int = 1000, min_n_features: int = 5, max_n_features: int = 100, eps: float = 0.1, delta_pad: float = 0.01)[source]

Test the statistical guarantees of the Distributional Johnson-Lindenstrauss Lemma using a method for finding the best possible delta (failure rate) for a given epsilon (error rate) of the difference of the L2 norm between vectors in the original dimension (x) and the lower embedding dimension (Ax) described in

M. Skorski. Johnson-Lindenstrauss Transforms with Best Confidence, Proceedings of Machine Learning Research 134, 1 (2021) http://proceedings.mlr.press/v134/skorski21a/skorski21a.pdf

This paper provides the optimal possible error probability

delta(sparse_dim, embed_dim, eps) =

P[abs(|Ax|**2 - |x|**2) > eps * |x|**2]

given the best possible matrix A and the worst possible data x.

Parameters
  • n (int, optional) – Number of data points. Default is 1,000

  • min_n_features (int, optional) – Minimum number of features any given vector may have. Default is 5

  • max_n_features (int, optional) – Maximum number of features any given vector may have. Default is 100

  • eps (float, optional) – Error rate. Default is 0.1

  • delta_pad (float, optional) – Amount by which we pad the calculated best delta value. This is done to reduce the probability of failure given the vagaries of running statistical tests. Default is 0.01

tests.test_pocket_dimension.test_johnson_lindenstrauss(n: int = 40, min_n_features: int = 5, max_n_features: int = 100, eps: float = 0.05)[source]

Test the the Johnson-Lindenstrauss Lemma holds between the sparse vectors and the randomly projected dense vectors. This uses the results found in

https://cs.stanford.edu/people/mmahoney/cs369m/Lectures/lecture1.pdf

This checkes that the ratio of the distance between two vectors in the embedding to the distance between the original sparse vectors is between (1-eps) and (1+eps). The embedding dimension is determined by the requested error rate, eps.

d = 24 * log(n) / (3 * eps**2 - 2 * eps ** 3)

where n is the number of vectors. This does the full n(n-1)/2 comparisions.

Parameters
  • n (int, optional) – Number of sparse vectors to randomly generate. Default is 40

  • min_n_features (int, optional) – Minimum number of features any given sparse vector may have. Default is 5

  • max_n_features (int, optional) – Maximum number of features any given sparse vector may have. Default is 100

  • eps (float, optional) – Desired error rate. Default is 0.05

tests.test_pocket_dimension.test_numba_idf()[source]

Test that the numba function numba_idf works as expected

tests.test_pocket_dimension.test_tf(d: int = 64)[source]

Basic test that things look right.

Parameters

d (int, optional) – Embedding dimension. Default is 64

tests.test_pocket_dimension.test_tf_cms(tmp_path, d: int = 64)[source]

Test that using a count-min sketch for filtering works properly

tests.test_pocket_dimension.test_tf_data_quality(d: int = 64)[source]

Test that records without minimium data quality (min_n_features & min_n_observations) return vectors

Parameters

d (int, optional) – Embedding dimension. Default is 64

tests.test_pocket_dimension.test_tf_filter(tmp_path, d: int = 64)[source]

Testing the bloom filter & filter_out parameters for the TFVectorizer

tests.test_pocket_dimension.test_tf_temp(d: int = 64)[source]

Test that changing the temperature used during embedding changes the cosine in an expected way

tests.test_pocket_dimension.test_tfidf(tmp_path, d: int = 64)[source]

Test that a tfidf vectorization causes one feature’s tfidf to be zero and match a vector that does not have this feature

tests.test_pocket_dimension.test_tfidf_data_quality(tmp_path, d: int = 64)[source]

Test that records without minimium data quality (min_n_features & min_n_observations) return vectors

Parameters
  • tmp_path (str) – Local temp directory to save count-min sketch to

  • d (int, optional) – Embedding dimension. Default is 64