pocket_dimension.random_projection¶
Pocket Dimension provides a memory-efficient, dense, random projection of sparse vectors and then applies this to Term Frequency (TF) and Term Frequency, Inverse Document Frequency (TFIDF) data.
Copyright (C) 2022 Matthew Hendrey & Brendan Murphy
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
-
class
pocket_dimension.random_projection.
JustInTimeRandomProjection
(n_components='auto', *, eps=0.1)[source]¶ Bases:
sklearn.random_projection.BaseRandomProjection
Reduce the dimensionality of sparse vectors using a dense random projection matrix in a memory-efficient way.
This is an alternative to scikit-learn.random_projection.SparseRandomProjection. This implementation provides higher quality embedding and is constant in the amount of RAM needed, irrespective of the sparse starting dimension. It achieves this by not storing the projection matrix in memory, but instead generates only the needed columns just-in-time for any given input vector. This makes the same as a SparseRandomProjection when the density = 1.0.
The elements of the projection matrix, \(\pm 1 / \sqrt{n\_components}\), are generated as needed using a hashing function to provide a random choice of the sign. This uses numba functions to enable faster computation and is parallelized across the available cpus.
This allows you to utilize sparse starting dimension of 2**31 - 1 without any issues. For really values of
n_components
transforming will also be faster than SparseRandomProjection.- Parameters
n_components (int or 'auto', default = 'auto') –
Dimensionality of the target projection space. If not a multiple of 64, then it will be rounded down to the nearest multiple of 64.
If ‘auto’, then it will pick a dimensionality that satisfies the Johnson-Lindenstrauss Lemma based upon the
eps
parameter. The dimensionality will then be rounded up to the nearest multiple of 64 to ensure you satisfy the conditions in the lemma.NOTE This can yield a very conservative estimate of the required dimensionality.
eps (float, default = 0.1) – Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss Lemma. This is used if
n_components
is ‘auto’. This represents the error rate between distances in the sparse dimension and the resulting lower embedding dimension. Smaller values ofeps
give larger values ofn_components
.
-
n_components_
¶ Dimensionality of the embedding dimension. It will be a multiple of 64.
- Type
int
Examples
import numpy as np from pocket_dimension.random_projection import ( JustInTimeRandomProjection, random_sparse_vectors, ) n_components = 256 sparse_dim = 1_073_741_824 n_samples = 100_000 X = random_sparse_vectors(n_samples, sparse_dim=sparse_dim, normalize=True) transformer = JustInTimeRandomProjection(n_components) # No need to fit if provide a value for n_components at initialization X_new = transformer.transform(X) X_new.shape # (25, 64)
-
fit
(X, y=None)[source]¶ This is essential a no-op function since there is no need to generate a random projection matrix. Instead, this is validating input data and if
n_components
is ‘auto’, then it sets it according to the Johnson-Lindenstrauss lemma limits based upon X.shape[1] andeps
.- Parameters
X (csr_matrix) – Only the shape is used to find the optimal value of
n_components
to satisfy Johnson-Lindenstrauss lemma theoretical guarantees.y (Ignored) –
- Returns
self – JustInTimeRandomProjection class instance
- Return type
object
-
transform
(X: scipy.sparse._csr.csr_matrix) → numpy.ndarray[source]¶ Project the data using the just-in-time generated dense random projection matrix.
- Parameters
X (csr_matrix) – The input data to project into a smaller dimensional space. Shape is (n_samples, sparse_dim)
- Returns
Projected data of shape (n_samples, n_components)
- Return type
np.ndarray
-
pocket_dimension.random_projection.
distributional_johnson_lindenstrauss_optimal_delta
(sparse_dim: int, n_components: int, eps: float) → float[source]¶ Algorithm to find the optimal failure rate, delta, for a given
eps
(error rate),sparse_dim
, andn_components
(embedding dimension) for the Distributional Johnson-Lindenstrauss Lemma which is a different formulation of the problem. Taken fromM. Skorski. Johnson-Lindenstrauss Transforms with Best Confidence, Proceedings of Machine Learning Research 134, 1 (2021)
which can be found at http://proceedings.mlr.press/v134/skorski21a/skorski21a.pdf
If \(\mathbf{A}\) is the random projection matrix, then \(\delta\) is the probability of exceeding error limits given by
\[\delta = \mathbb{P} \lbrack \vert \Vert \mathbf{A}x \Vert^2_2 - \Vert x \Vert^2_2 \vert > \epsilon \Vert x \Vert^2_2 {\rbrack}\]- Parameters
sparse_dim (int) – Original dimension of the data
n_components (int) – Embedding dimension
eps (float) – Error rate
- Returns
delta – The best probability of failure that you can expect
- Return type
float
-
pocket_dimension.random_projection.
random_projection
[source]¶ Randomly project a sparse matrix (standard CSR representation) to a dense vectors by effectively multiplying by a random matrix whose elements are \(\pm 1 / \sqrt{d}\). The projection matrix is never stored in memory. Instead the elements are generated as needed using splitmix64(). Note Embedding dimension, d, is rounded to a multiple of 64 since we generate random bits in batches of 64 in order to utilize bit manipulation to speed things up.
The column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and the corresponding values are stored in data[indptr[i]:indptr[i+1]].
- Parameters
data (np.ndarray, shape=(nnz,), dtype=float32) – The values of the nonzero elements in the sparse csr_matrix where nnz = number of nonzero elements
indices (np.ndarray, shape=(nnz,), dtype=int64) – The column indices of the nonzero elements in the sparse csr_matrix
indptr (np.ndarray, shape=(n_rows+1,), dtype=int64) – Pointers into data and indices to indicate where the rows start and stop. If you have just a single record, then indtpr=[0, len(data)]
d (int) – Embedding dimension of dense vectors.
- Returns
X – Dense 2-d array containing the randomly projected dense vectors for each row of the input sparse matrix.
- Return type
np.ndarray, shape=(n_rows, d), dtype=float32
-
pocket_dimension.random_projection.
random_sparse_vectors
(n_samples: int, *, sparse_dim: int = 2147483647, min_n_features: int = 20, max_n_features: int = 100, normalize: bool = False, rng: numpy.random._generator.Generator = None) → scipy.sparse._csr.csr_matrix[source]¶ Randomly generate sparse vectors for testing
- Parameters
n_samples (int) – Number of sparse vectors to create
sparse_dim (int, default = 2**31 - 1) – Dimensionality of the sparse vectors.
min_n_features (int, default = 10) – Minimum number of features any given sparse vector may have
max_n_features (int, default = 100) – Maximum number of features any given sparse vector may have
normalize (bool, default = False) – If true, normalizes the sparse vectors to all have unit length.
rng (np.random.Generator, default = None) – A numpy random generator. If None, then one is created
- Returns
Shape is (n_samples, sparse_dim)
- Return type
csr_matrix
pocket_dimension.vectorizer¶
Pocket Dimension provides a memory-efficient, dense, random projection of sparse vectors and then applies this to Term Frequency (TF) and Term Frequency, Inverse Document Frequency (TFIDF) data.
Copyright (C) 2022 Matthew Hendrey & Brendan Murphy
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
-
class
pocket_dimension.vectorizer.
TFIDFVectorizer
(d: int, cms_file: Union[str, pathlib.Path], *, hash_dim: int = 2147483647, minDF: int = 1, maxDF: int = 4294967295, temperature: float = 1.0, min_n_features: int = 1, min_n_observations: int = 1, filter: str = None, filter_out: bool = True)[source]¶ Bases:
pocket_dimension.vectorizer.TFVectorizer
Randomly project records by first applying a high-dimensional feature hasher to create a sparse vector representation of the tf-idf and then applying the random projection to a dense embedding dimension.
\[tfidf = c^{1/temp} \log{10}(n\_records / (doc\_freq + 1))\]
-
class
pocket_dimension.vectorizer.
TFVectorizer
(d: int, *, cms_file: Union[str, pathlib.Path] = None, hash_dim: int = 2147483647, minDF: int = 1, maxDF: int = 4294967295, temperature: float = 1.0, min_n_features: int = 1, min_n_observations: int = 1, filter: str = None, filter_out: bool = True)[source]¶ Bases:
object
Calculate a Term-Frequency vector representation by first using sklearn’s FeatureHasher to create a high-dimensional, sparse vector representation. Then apply the JustInTimeRandomProjection to return a lower-dimensional dense vector representation. The resulting vectors will be normalized to unit length.
- Parameters
d (int) – Dimension of the dense vector output. Rounded down to nearest multiple of 64 if not already
cms_file (str | Path, optional) – Filename of a saved count-min sketch to use for filtering out features whose document frequency falls outside of [minDF, maxDF]. Default is None
hash_dim (int, optional) – Number of dimensions that features get hashed to for the sparse vector representation. Should have hash_dim >> d. Default is the largest possible 2**31 - 1 (~2 billion) for sklearn’s FeatureHasher
minDF (int, optional) – Minimum document frequency (number of records with this feature) a feature must have to be included in the embedding. Default is 1
maxDF (int, optional) – Maximum document frequency (number of records with this feature) a feature can have to be included in the embedding. Default is 2**32 - 1.
temperature (float, optional) – Option to reshape the term frequency vector by raising each count to (1/temperature). Using a value above 1.0 flattens the values relative to each other. Using a value below 1.0 sharpens the contrast of values relative to each other.
min_n_features (int) – Must have at least this many features in order to create a vector. Useful as a data quality check. Default is 1
min_n_observations (int) – Must have at least his many observations in order to create a vector. Useful as a data quality check. This is the sum of counts over the different features for a given item. Default is 1
filter (pybloomfilter3, option) – Bloom filter to use for filtering features before creating sparse vectors. Default is None
filter_out (bool, optional) – If True, then remove features that are in the filter before making the sparse vector. If False, then only use features that are in the filter in the sparse vector. Not used if filter is None. Default is True.
-
d
¶ Embedding dimension of the dense vector. Must be multiple of 64
- Type
int
-
hash_dim
¶ Dimension of the sparse vector. Must be <= 2**31-1
- Type
int
-
hasher
¶ Hashes input features (bytes) to integer representing corresponding dimension
- Type
sklearn.feature_extraction.FeatureHasher
-
jit_rp
¶ Efficiently projects high-dimensional sparse vectors down to lower dimensional dense vectors
-
temperature
¶ Exponential scale raw counts by 1 / temperature when making TF vector
- Type
float
-
min_n_features
¶ Must have at least this many features in order to create a vector
- Type
int
-
min_n_observations
¶ Must have at least his many observations in order to create a vector
- Type
int
-
cms
¶ A count-min sketch that stores document frequency info
- Type
sketchnu.CountMinLinear | sketchnu.CountMinLog16 | sketchnu.CountMinLog8
-
minDF
¶ Minimum document frequency a feature must have to be incuded in the vector
- Type
int
-
maxDF
¶ Maximum document frequency a feature can have to be included in the vector
- Type
int
-
filter
¶ BloomFilter containing features to be filtered
- Type
pybloomfilter.BloomFilter
-
filter_out
¶ If True, then any feature found in the filter will be excluded from the vector. If False, then only features found in the filter are included in vector.
- Type
bool
-
filter_reweight_features
(features: List[bytes], counts: List[int]) → List[Tuple[bytes, float]][source]¶ Filter the features accordingly and reweight the associated counts. The counts are scaled by raising them to \(1/temperature\) and then multiplied by the
idf
value. For TFVectorizer \(idf=1.0\).Features will be filtered out if their document frequency is not within [
minDF
,maxDF
] and if afilter
is provided during initialization. Only those features that pass the filters will contribute to the final vector.After filtering, a final check is done to ensure that there are at least
min_n_features
features remaining and that there are at leastmin_n_observations
observations of those features. Otherwise an empty list is returned.- Parameters
features (List[bytes]) – List of the individual features. These will be hashed by the FeatureHasher to turn into a sparse vector representation
counts (List[int]) – Number of times that feature is present in a given record.
- Returns
- Return type
List[Tuple[bytes, float]]
-
yield_record
(records: Iterable[Dict], ids: List)[source]¶ Yields a list of an individual record’s (features, values) where the values are the counts after reweighting them by
temperature
and/oridf
.- Parameters
records (Iterable[Dict]) – An iterable of records where each records is a dict of {“id”: str, “features”: List[bytes], “counts”: List[int]}
ids (List) – Pass in an empty list which will be returned with corresponding ids from the records passed in
- Yields
List[Tuple[bytes, float]] – Of the appropriate individual (feature, value) for a given record
tests.test_pocket_dimension¶
Pocket Dimension provides a memory-efficient, dense, random projection of sparse vectors and then applies this to Term Frequency (TF) and Term Frequency, Inverse Document Frequency (TFIDF) data.
Copyright (C) 2022 Matthew Hendrey & Brendan Murphy
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
-
tests.test_pocket_dimension.
sparse_distance_squared
(S1: scipy.sparse._csr.csr_matrix, S2: scipy.sparse._csr.csr_matrix) → float[source]¶ Calculate the distance between two sparse vectors
-
tests.test_pocket_dimension.
test_distributional_johnson_lindenstrauss
(n: int = 1000, min_n_features: int = 5, max_n_features: int = 100, eps: float = 0.1, delta_pad: float = 0.01)[source]¶ Test the statistical guarantees of the Distributional Johnson-Lindenstrauss Lemma using a method for finding the best possible delta (failure rate) for a given epsilon (error rate) of the difference of the L2 norm between vectors in the original dimension (x) and the lower embedding dimension (Ax) described in
M. Skorski. Johnson-Lindenstrauss Transforms with Best Confidence, Proceedings of Machine Learning Research 134, 1 (2021) http://proceedings.mlr.press/v134/skorski21a/skorski21a.pdf
This paper provides the optimal possible error probability
- delta(sparse_dim, embed_dim, eps) =
P[abs(|Ax|**2 - |x|**2) > eps * |x|**2]
given the best possible matrix A and the worst possible data x.
- Parameters
n (int, optional) – Number of data points. Default is 1,000
min_n_features (int, optional) – Minimum number of features any given vector may have. Default is 5
max_n_features (int, optional) – Maximum number of features any given vector may have. Default is 100
eps (float, optional) – Error rate. Default is 0.1
delta_pad (float, optional) – Amount by which we pad the calculated best delta value. This is done to reduce the probability of failure given the vagaries of running statistical tests. Default is 0.01
-
tests.test_pocket_dimension.
test_johnson_lindenstrauss
(n: int = 40, min_n_features: int = 5, max_n_features: int = 100, eps: float = 0.05)[source]¶ Test the the Johnson-Lindenstrauss Lemma holds between the sparse vectors and the randomly projected dense vectors. This uses the results found in
https://cs.stanford.edu/people/mmahoney/cs369m/Lectures/lecture1.pdf
This checkes that the ratio of the distance between two vectors in the embedding to the distance between the original sparse vectors is between (1-eps) and (1+eps). The embedding dimension is determined by the requested error rate, eps.
d = 24 * log(n) / (3 * eps**2 - 2 * eps ** 3)
where n is the number of vectors. This does the full n(n-1)/2 comparisions.
- Parameters
n (int, optional) – Number of sparse vectors to randomly generate. Default is 40
min_n_features (int, optional) – Minimum number of features any given sparse vector may have. Default is 5
max_n_features (int, optional) – Maximum number of features any given sparse vector may have. Default is 100
eps (float, optional) – Desired error rate. Default is 0.05
-
tests.test_pocket_dimension.
test_numba_idf
()[source]¶ Test that the numba function numba_idf works as expected
-
tests.test_pocket_dimension.
test_tf
(d: int = 64)[source]¶ Basic test that things look right.
- Parameters
d (int, optional) – Embedding dimension. Default is 64
-
tests.test_pocket_dimension.
test_tf_cms
(tmp_path, d: int = 64)[source]¶ Test that using a count-min sketch for filtering works properly
-
tests.test_pocket_dimension.
test_tf_data_quality
(d: int = 64)[source]¶ Test that records without minimium data quality (min_n_features & min_n_observations) return vectors
- Parameters
d (int, optional) – Embedding dimension. Default is 64
-
tests.test_pocket_dimension.
test_tf_filter
(tmp_path, d: int = 64)[source]¶ Testing the bloom filter & filter_out parameters for the TFVectorizer
-
tests.test_pocket_dimension.
test_tf_temp
(d: int = 64)[source]¶ Test that changing the temperature used during embedding changes the cosine in an expected way
-
tests.test_pocket_dimension.
test_tfidf
(tmp_path, d: int = 64)[source]¶ Test that a tfidf vectorization causes one feature’s tfidf to be zero and match a vector that does not have this feature
-
tests.test_pocket_dimension.
test_tfidf_data_quality
(tmp_path, d: int = 64)[source]¶ Test that records without minimium data quality (min_n_features & min_n_observations) return vectors
- Parameters
tmp_path (str) – Local temp directory to save count-min sketch to
d (int, optional) – Embedding dimension. Default is 64