sketchnu.hyperloglog¶
Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.
Copyright (C) 2022 Matthew Hendrey
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
Too many problems with the experimental Numba class. Converting this to a series of Numba functions and a Python class.
Example
To get a HyperLogLog with precision of 16 and seed 0:
from sketchnu.hyperloglog import HyperLogLog
hll = HyperLogLog()
key = 'testing'.encode('utf-8')
hll.add(key)
# To get a estimated cardinality
hll.query()
-
class
sketchnu.hyperloglog.
HyperLogLog
(p: int = 16, seed: int = 0, shared_memory: bool = False)[source]¶ Bases:
object
Implementation of the HyperLogLog++ algorithm which uses a 64-bit hash function and corrects bias when cardinality is low.
- Parameters
p (int) – Precision specifies the number of registers (2**p). The larger the p, the more accurate the estimated cardinality. Must be between [7, 16]
seed (int, optional) – Seed passed to the fasthash64 function. Default is 0
shared_memory (bool, optional) – If True, then HyperLogLog is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.
-
p
¶ Precision of the HyperLogLog. Must be between [7, 16]
- Type
int
-
seed
¶ Seed passed to hash function
- Type
int
-
m
¶ Number of registers. Equal to 2**p
- Type
int
If True, then HyperLogLog is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.
- Type
bool, optional
-
add
(key: bytes, value: int = 1) → None[source]¶ Add a single key to the HyperLogLog.
Note
Currently, the numba.typed.List needed for the update() are quite slow. Speedtest show that you are better off looping through the keys in python and add(key) instead of update(numba.typed.List(keys)). Taking just over twice as long.
- Parameters
key (bytes) – Element to add to the HyperLogLog
value (int, optional) – Number of times to add key to the sketch. Ignored for HyperLogLog but included in order to have the same API as the other sketches. Default is 1
- Returns
- Return type
None
-
add_ngram
(key: bytes, ngram: int) → None[source]¶ Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key
- Parameters
key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size
- Returns
- Return type
None
-
attach_existing_shm
(existing_shm_name: str) → multiprocessing.shared_memory.SharedMemory[source]¶ Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.
- Parameters
existing_shm_name (str) – Name an existing shared memory block to attach this sketch to
- Returns
- Return type
None
-
static
load
(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶ Load a saved HyperLogLog sketch stored in filename
- Parameters
filename (str | Path) – File path to the saved .npz file
shared_memory (bool) – If True, copy registers into shared memory
- Returns
- Return type
-
merge
(other) → None[source]¶ Merge the HyperLogLog other into this one.
- Parameters
other (HyperLogLog) – Another HyperLogLog with the same precision and seed
- Returns
- Return type
None
- Raises
TypeError – If other has different precision or seed from this one
-
query
() → float[source]¶ Return the estimated cardinality
- Returns
Estimated cardinality of elements added to the HyperLogLog so far
- Return type
float
-
save
(filename: Union[str, pathlib.Path]) → None[source]¶ Save the HyperLogLog sketch, hll, to the file, filename
- Parameters
filename (str | Path) – File to save the hll to disk. This will be a .npz file.
- Returns
- Return type
None
-
update
(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]¶ Given a list of keys, update the HyperLogLog. This follows the convention of collections.Counter in that keys should be a list of keys or a dictionary whose keys are the keys. For this sketch, the value of the dictionary is ignored since it doesn’t make sense to put the same key in multiple times.
Note
Currently, the numba.typed.List needed for the update() are quite slow. Speedtest show that you are better off looping through the keys in python and add(key) instead of update(numba.typed.List(keys)). Taking just over twice as long. If repeated adding the same keys, then update is faster.
- Parameters
keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the HyperLogLog at once. If a Dict is passed, then only the Dict.keys() are put into the sketch.
- Returns
- Return type
None
-
update_ngram
(keys: bytes, ngram: int) → None[source]¶ Given a list of keys, split each into ngrams of size ngram, and then add them to the sketch.
Note
Speed tests show that you are slightly better off looping through the keys in Python and calling add_ngram() instead of update_ngram(). Used 50k keys, each of 200 bytes, and ngram=4 to find that update_ngram() was ~ 6% slower. This is significantly better than add() vs. update(). If repeated adding the same keys (for testing), then update_ngram() is just a little bit faster than add_ngram().
- Parameters
keys (List[bytes]) – List of elements to be shingled before adding to the sketch.
ngram (int) – ngram size
- Returns
- Return type
None
sketchnu.countmin¶
Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.
Copyright (C) 2022 Matthew Hendrey
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
Too many problems with the experimental Numba class. Converting this to a series of Numba functions and a Python classes.
Implementation of count-min sketch using Numba. Three different types have been coded, but all use conservative updating to help reduce errors. The three types are
linear : Uses 32-bit linear counters
log16 : Uses 16-bit log counters
log8 : Uses 8-bit log counters
Each type is implemented as a Python class, but uses numba functions under the hood
where possible (and faster). The convenience function, CountMin()
is the
recommended way to instantiate a count-min sketch. The Python classes are
CountMinLinear
, CountMinLog16
, and CountMinLog8
.
Example
To get a linear CountMin with width = 2**20
from sketchnu.countmin import CountMin
width = 2**20
cms = CountMin("linear", width)
key = "testing".encode("utf-8")
cms.add(key)
# To get an estimated count for that key use either
cms.query(key)
cms[key]
Each count-min sketch implementation has two special counters that are stored in the
attribute n_added_records
which is a 1-d array with two elements. The first
records the total number of elements added to the count-min sketch. This is useful when
calculating the error guarantees. Its value can be retrieved with n_added()
.
The second element is used by helpers.parallel_add()
to store the number of
records added to the sketch. This value is needed when using a count-min sketch to
calculate the tf-idf value for a given record. The idf piece needs the number of
records which have a given key, i.e. document frequency. This can be provided directly
by the count-min sketch. But you also need to know the total number of records. This
value can be retrieved by n_records()
if you use helpers.parallel_add()
to create the count-min sketch.
-
sketchnu.countmin.
CountMin
(cms_type: str, width: int, depth: int = 8, max_count: int = 4294967295, num_reserved: int = None, shared_memory: bool = False) → Union[sketchnu.countmin.CountMinLinear, sketchnu.countmin.CountMinLog16, sketchnu.countmin.CountMinLog8][source]¶ Convenience function to instantiate a count-min sketch of the given type.
- Parameters
cms_type (str) – Must be ‘linear’ | ‘log16’ | ‘log8’
width (int) – Width of the count-min sketch. Best if you keep width >= n_unique/2
depth (int, optional) – Depth of the count-min sketch. Sets the number of different hash functions used. Probability of exceeding error limits is determined by the depth, exp(-depth). Default is 8 which should be fine for most circumstances.
max_count (int, optional) – Maximum value the count a given element may reach. Not used by ‘linear’ type. Default is 2**32 - 1 (4,294,967,295) to match linear type.
num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times an element is seen for counts <= num_reserved. Not used by the ‘linear’ type. Default is None which uses 15 for log8 and 1023 for log16. This must be less than the maximum uint value: 65,535 for log16 and 255 for log8.
shared_memory (bool, optional) – If True, then count-min sketch is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.
- Returns
cms – A count-min sketch of the requested type and size
- Return type
CountMinLinear | CountMinLog16 | CountMinLog8
-
class
sketchnu.countmin.
CountMinLinear
(width: int, depth: int = 8, shared_memory: bool = False)[source]¶ Bases:
object
Count-min sketch that uses 32-bit linear counters with conservative updating. A given element’s maximum count is 2**32 - 1
- Parameters
width (int) – Width of the count-min sketch. Must be non-negative
depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8
shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.
-
width
¶ Width of the 2-d array of counters of the count-min sketch
- Type
np.uint64
-
depth
¶ Depth of the 2-d array of counters of the count-min sketch
- Type
np.uint64
Whether cms and n_added_records are attached to a shared memory block
- Type
bool
-
cms
¶ 2-d array of the counters. Shape = (depth, width)
- Type
np.uint32[:,:]
-
n_added_records
¶ 1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.
- Type
np.uint64[:]
-
add
(key: bytes, value: int = 1) → None[source]¶ Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].
- Parameters
key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. value will be capped at 4,294,967,295 to prevent overflows. Default is 1
- Returns
- Return type
None
-
add_ngram
(key: bytes, ngram: int) → None[source]¶ Take a given key and shingle it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key
- Parameters
key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size
- Returns
- Return type
None
-
attach_existing_shm
(existing_shm_name: str) → None[source]¶ Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.
- Parameters
existing_shm_name (str) – Name of an existing shared memory block to attach this sketch to
- Returns
- Return type
None
-
static
load
(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶ Load a saved CountMinLinear stored in filename
- Parameters
filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.
- Returns
- Return type
-
merge
(other) → None[source]¶ Merge the count-min sketch other into this one.
- Parameters
other (CountMinLinear) – Another CountMinLinear with the same width and depth.
- Returns
- Return type
None
- Raises
TypeError – If other has different width, depth, or dtype
-
n_added
() → numpy.uint64[source]¶ This special counter is used to track the total number of elements that have been added to the sketch. Useful to check the error guarantees.
- Returns
The number of elements that have been added to the sketch.
- Return type
np.uint64
-
n_records
() → numpy.uint64[source]¶ This special counter is used by the sketchnu.helpers.parallel_add() to keep track of the number of records that have been added to the sketch. This can be used as the numerator of the idf piece of a tf-idf.
- Returns
The number of records that have been added to the sketch.
- Return type
np.uint64
-
query
(key: bytes) → int[source]¶ Return estimated number of times key was added into the count-min sketch
- Parameters
key (bytes) – Element whose estimated count you want returned
- Returns
- Return type
int
-
save
(filename: Union[str, pathlib.Path]) → None[source]¶ Save the sketch to filename
- Parameters
filename (str | Path) – File to save the sketch to disk. This will be a .npz file.
- Returns
- Return type
None
-
update
(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]¶ Add keys to the sketch. This follows the convention of collections.Counter
- Parameters
keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the sketch or a dictionary which can specify the number of times to add each key
- Returns
- Return type
None
-
update_ngram
(keys: List[bytes], ngram: int) → None[source]¶ Given a list of keys, split each into ngrams of size ngram, and then add them to the sketch.
Note
Current implementation loops through the keys in Python. Speed testing showed that converting the list to numba’s typed list was very costly. Maybe if that gets faster a pure numba implementation can be done.
- Parameters
keys (List[bytes]) – List of elements to be shingled before adding to the sketch.
ngram (int) – ngram size
- Returns
- Return type
None
-
class
sketchnu.countmin.
CountMinLog16
(width: int, depth: int = 8, max_count=4294967295, num_reserved=1023, shared_memory: bool = False)[source]¶ Bases:
sketchnu.countmin.CountMinLinear
Count-min sketch that uses 16-bit log counters with conservative updating.
- Parameters
width (int) – Width of the count-min sketch. Must be non-negative.
depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8.
max_count (int, optional) – The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).
num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 1023. This must be less than 65,535 (2**16 - 1).
shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.
-
width
¶ Width of the 2-d array of counters of the count-min sketch
- Type
np.uint64
-
depth
¶ Depth of the 2-d array of counters of the count-min sketch
- Type
np.uint64
-
max_count
¶ The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).
- Type
int, optional
-
num_reserved
¶ Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 1023. This must be less than 65,535 (2**16 - 1).
- Type
int, optional
Whether cms and n_added_records are attached to a shared memory block
- Type
bool
-
cms
¶ 2-d array of the counters. Shape = (depth, width)
- Type
np.uint32[:,:]
-
n_added_records
¶ 1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.
- Type
np.uint64[:]
-
base
¶ Calculated base for the log counters. Depends affected by max_count and num_reserved.
- Type
np.float64
-
add
(key: bytes, value: int = 1) → None[source]¶ Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].
- Parameters
key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. Default is 1
- Returns
- Return type
None
-
add_ngram
(key: bytes, ngram: int) → None[source]¶ Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key
- Parameters
key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size
- Returns
- Return type
None
-
static
load
(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶ Load a saved CountMinLog16 stored in filename
- Parameters
filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.
- Returns
- Return type
-
merge
(other) → None[source]¶ Merge the count-min sketch other into this one.
- Parameters
other (CountMinLog16) – Another CountMinLog16 with the same parameters.
- Returns
- Return type
None
- Raises
TypeError – If other has different parameters
-
class
sketchnu.countmin.
CountMinLog8
(width: int, depth: int = 8, max_count=4294967295, num_reserved=15, shared_memory: bool = False)[source]¶ Bases:
sketchnu.countmin.CountMinLog16
Count-min sketch that uses 8-bit log counters with conservative updating.
- Parameters
width (int) – Width of the count-min sketch. Must be non-negative.
depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8.
max_count (int, optional) – The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).
num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 15. This must be less than 255 (2**8 - 1).
shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.
-
width
¶ Width of the 2-d array of counters of the count-min sketch
- Type
np.uint64
-
depth
¶ Depth of the 2-d array of counters of the count-min sketch
- Type
np.uint64
-
max_count
¶ The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).
- Type
int, optional
-
num_reserved
¶ Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 15. This must be less than 255 (2**8 - 1).
- Type
int, optional
Whether cms and n_added_records are attached to a shared memory block
- Type
bool
-
cms
¶ 2-d array of the counters. Shape = (depth, width)
- Type
np.uint32[:,:]
-
n_added_records
¶ 1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.
- Type
np.uint64[:]
-
base
¶ Calculated base for the log counters. Depends affected by max_count and num_reserved.
- Type
np.float64
-
add
(key: bytes, value: int = 1) → None[source]¶ Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].
- Parameters
key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. Default is 1
- Returns
- Return type
None
-
add_ngram
(key: bytes, ngram: int) → None[source]¶ Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key
- Parameters
key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size
- Returns
- Return type
None
-
static
load
(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶ Load a saved CountMinLog8 stored in filename
- Parameters
filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.
- Returns
- Return type
-
merge
(other) → None[source]¶ Merge the count-min sketch other into this one.
- Parameters
other (CountMinLog8) – Another CountMinLog8 with the same parameters.
- Returns
- Return type
None
- Raises
TypeError – If other has different parameters
-
sketchnu.countmin.
load
(filename: Union[str, pathlib.Path], shared_memory: bool = False) → Union[sketchnu.countmin.CountMinLinear, sketchnu.countmin.CountMinLog16, sketchnu.countmin.CountMinLog8][source]¶ Load a saved count-min sketch stored in filename
- Parameters
filename (str | Path) – File path to the saved .npz file
shared_memory (bool) – If True, load into shared memory
- Returns
- Return type
CountMinLinear | CountMinLog16 | CountMinLog8
sketchnu.heavyhitters¶
Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.
Copyright (C) 2022 Matthew Hendrey
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
Numba implementation of the topkapi algorithm from
A. Mandal, H. Jiang, A. Shrivastava, and V. Sarkar, “Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements”, Advances in Neural Information Processing Systems 31, (2018).
-
class
sketchnu.heavyhitters.
HeavyHitters
(width: int, depth: int = 4, max_key_len: int = 16, phi: float = None, shared_memory: bool = False)[source]¶ Bases:
object
Sketch implementation of the phi-heavy hitters algorithm which identifies all the keys in a data stream that are observed in at least phi fraction of the records. This assumes that keys have a fat-tailed distribution in the data stream. This is an implementation of the Topkapi algorithm.
The parameter phi must be greater than 1 / width of the sketch for the theoretical guarantees to be valid. The theoretical guarantees use a count-min sketch to estimate the frequency of any given key. For practical reasons, the paper suggests dropping the count-min sketch to save space. In this case you use the lhh_count as an estimate of the frequency of occurence for a given key. Note that lhh_count <= true count <= cms. So when we call the query function we are doing a more conservative estimate which corresponds to a higher phi. If you are feeling like you want to squeeze more out of the sketch you can provide a lower threshold (= phi * n_added()) when calling the query() function.
- Parameters
width (int) – Width of the heavy hitters sketch. Must be non-negative
depth (int, optional) – Depth of the heavy hitters sketch. Must be non-negative. Default is 4
max_key_len (int, optional) – Maximum number of bytes any given key may have. Must be less than 256. Default is 16
phi (float, optional) – When generating the candidate set of heavy hitters, only keys whose estimated frequency of occurrence (lhh_count) >= phi * n_added() will be added to the candidate set. Default of None is set to 1 / width.
shared_memory (bool, optional) – If True, then sketch is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False
-
width
¶ Width of the 2-d array of counters of the sketch
- Type
np.uint64
-
depth
¶ Depth of the 2-d array of counters of the sketch
- Type
np.uint64
-
max_key_len
¶ Maximum number of bytes any given key may have. Must be less than 256
- Type
np.uint64
-
phi
¶ When generating the candidate set of heavy hitters, only keys whose estimated frequency of occurrence (lhh_count) >= phi * n_added() will be added to the candidate set. Default of None is set to 1 / width.
- Type
np.float64
-
lhh
¶ Storing the keys associated with each bucket in the 2-d array of counters. Keys are stored as numpy arrays, as opposed to 2-d list of bytes, in order for numba to be able to process them. If a given key has fewer bytes than max_len_key, then right padded with 0s.
- Type
np.ndarray, shape=(depth, width, max_key_len), dtype=np.uint8
-
lhh_count
¶ Store the counts associated with keys stored in lhh.
- Type
np.ndarray, shape=(depth, width), dtype=np.uint32
-
key_lens
¶ The length of each of the keys stored in lhh
- Type
np.ndarray, shape=(depth, width), dtype=np.uint8
-
n_added_records
¶ 1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed.
- Type
np.ndarray, shape=(2,), dtype=np.uint64
-
add
(key: bytes, value: int = 1) → None[source]¶ Add a single key to the heavy hitters sketch and update the counter tracking total number of keys added to the sketch.
- Parameters
key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. Default is 1
- Returns
- Return type
None
-
add_ngram
(key: bytes, ngram: int) → None[source]¶ Take a given key and shingle it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key
- Parameters
key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size
- Returns
- Return type
None
-
attach_existing_shm
(existing_shm_name: str) → None[source]¶ Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.
- Parameters
existing_shm_name (str) – Name of an existing shared memory block to attach this sketch to
- Returns
- Return type
None
-
generate_candidate_set
(threshold: int = None) → None[source]¶ Generate a candidate set of heavy hitters. Only keys in lhh whose corresponding counts in lhh_count are greater the threshold are included. Contrary to the paper, we all the rows instead of just the first one. This seems like a small price to pay to not lose candidates due to hash collisions. The candidate set is a collections.Counter stored in self.candidate_set
- Parameters
threshold (int, optional) – If None (default), then uses threshold provided during instantiation
- Returns
- Return type
None
-
static
load
(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶ Load a saved HeavyHitters stored in filename
- Parameters
filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.
- Returns
- Return type
-
merge
(other) → None[source]¶ Merge the HeavyHitter sketch other into this one.
- Parameters
other (HeavyHitters) – Another HeavyHitters with the same width, depth, max_key_len.
- Returns
- Return type
None
- Raises
TypeError – If other has different width, depth, or max_key_len
-
n_added
() → numpy.uint64[source]¶ This special counter is used to track the total number of elements that have been added to the sketch. Useful to check the error guarantees.
- Returns
The number of elements that have been added to the sketch.
- Return type
np.uint64
-
n_records
() → numpy.uint64[source]¶ This special counter is used by the sketchnu.helpers.parallel_add() to keep track of the number of records that have been added to the sketch.
- Returns
The number of records that have been added to the sketch.
- Return type
np.uint64
-
query
(k: int, threshold: int = None) → List[Tuple[bytes, int]][source]¶ Return the top k heavy hitters. If new data has been added or if threshold is different from the last time a candidate set was generated, then this will generate a new candidate set before selecting the top k.
- Parameters
k (int) –
threshold (int, optional) – Only include keys from lhh whose lhh_counts >= threshold. Default is None which then sets threshold to self.phi * self.n_added()
- Returns
Sorted list of the (key, count) of the top k. Format is the same as collections.Counter().most_common().
- Return type
List[Tuple[bytes, int]]
-
save
(filename: Union[str, pathlib.Path]) → None[source]¶ Save the sketch to filename adding the .npz extension if not already part of filename
- Parameters
filename (str | Path) – File to save the sketch to disk. This will be a .npz file.
- Returns
- Return type
None
-
update
(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]¶ Add keys to the sketch. This follows the convention of collections.Counter
- Parameters
keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the sketch or a dictionary which can specify the number of times to add each key
- Returns
- Return type
None
sketchnu.helpers¶
Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.
Copyright (C) 2022 Matthew Hendrey
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
Helper functions to aid in parallelizating the creation of sketches using Python’s
multiprocessing
.
Create a new sketch and attach it to the shared memory block. Sketch is created by passing sketch_args to either CountMin(), HeavyHitters() or HyperLogLog().
- Parameters
sketch_type (str) – Specify the type of sketch to create. Must be “cms” | “hh” | “hll”.
sketch_args (Dict) – The associated arguments to instantiante a new sketch.
shm_name (str) – Name of a shared memory block that already exist. The new sketch will attach to this block.
- Returns
local_sketch – A local sketch, relative to this process, that is attached to the shared memory block of an existing sketch that has shared_memory = True
- Return type
CountMinLinear | CountMinLog16 | CountMinLog8 | HeavyHitters | HyperLogLog
-
sketchnu.helpers.
parallel_add
(items: Iterable, process_q_item: Callable[[...], int], n_workers: int = None, cms_args: Dict = None, hh_args: Dict = None, hll_args: Dict = None, **kwargs)[source]¶ Places items onto a queue to be processed by n_workers independent spawned processes.
The user defined function,
process_q_item
, takes the arguments (q_item, *sketches, **kwargs). This function is given a single item from the queue and then should add elements to the sketch(es) as desired. The function should return the number of records processed. Ifprocess_q_item
will add elements to multiple sketches, then they must be listed in alphabetic order since that is howparallel_add
will pass them toprocess_q_item
.The **kwargs are passed along to
process_q_item
to allow for any needed additional parameters.Once all items have been processed, the n_workers sketch(s) are merged with the final result(s) returned.
You must provide at least cms_args | hh_args | hll_args. If you provide more than one, then the requested sketches will be processed at the same time while going over the data just once.
Note: If your data has duplicate keys within a item, you will likely see better performance if
process_q_item
aggregates the counts per key and then calls thesketch.add(key, count)
- Parameters
items (Iterable) – A generator or list of items that will be placed onto a queue and then worked by one of the workers in a separate spawned process.
process_q_item (Callable[.., int]) – User defined function whose arguments are (q_item, *sketches, **kwargs) that takes the q_item, adds elements to the sketch(es) and returns the number of records that were processed. *sketches must be in alphabetic order since that is how they will be passed by parallel_add.
n_workers (int, optional) – Number of workers to use. Each will update their own sketches which will then get merged together to achieve the final sketch(s). If None (default), then set to psutil.cpu_count(logical=False)
cms_args (Dict, optional) – Dictionary containing arguments to instantiate a CountMin. If None (default) then don’t create a sketch of this type.
hh_args (Dict, optional) – Dictionary containing arguments to instantiate a HeavyHitters. If None (default) then don’t create a sketch of this type.
hll_args (Dict, optional) – Dictionary containing arguments to instantiate a HyperLogLog. If None (default) then don’t create a sketch of this type.
**kwargs – Keyword arguments that get passed to
process_q_item
function
- Returns
The final sketch(s). If doing more than one sketch, then they are returned as a tuple in alphabetical order: cms, hh, hll
- Return type
sketches
-
sketchnu.helpers.
parallel_merging
(sketch_array: List, log_queue: multiprocessing.context.BaseContext.Queue)[source]¶ Merge an array of sketches in successive rounds of pairing. This will use at most len(sketch_array) // 2 processes. After merging, the final sketch is returned.
- Parameters
sketch_array (List) – Array containing sketches to be merged together
log_queue (Queue) – Log statements to a log Queue
- Returns
The final sketch resulting from merging all of the sketches in the sketch_array.
- Return type
CountMinLinear | CountMinLog16 | CountMinLog8 | HyperLogLog
sketchnu.hashes¶
Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.
Copyright (C) 2022 Matthew Hendrey
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
Numba implementations of the non-cryptographic hashing functions FastHash (32 & 64-bit) and MurmurHash3 (32-bit). The 64-bit FastHash is used in both the HyperLogLog and count-min sketch code.
-
sketchnu.hashes.
fasthash32
[source]¶ Calculate the unsigned 32-bit integer FastHash value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/
- Parameters
key (bytes) – Bytes to be hashed
seed (uint64) – Seed to use when hashing
- Returns
Unsigned 32-bit integer hash value of key
- Return type
uint32
-
sketchnu.hashes.
fasthash64
[source]¶ Calculate the unsigned 64-bit integer FastHash value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/
- Parameters
key (bytes) – Bytes to be hashed
seed (uint64) – Seed to use when hashing
- Returns
Unsigned 64-bit integer hash value of key
- Return type
uint64
-
sketchnu.hashes.
murmur3
[source]¶ Calculate the unsigned 32-bit integer MurmurHash3 value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/
- Parameters
key (bytes) – Bytes to be hashed
seed (uint32) – Seed to use when hashing
- Returns
Unsigned 32-bit integer hash value of key
- Return type
uint32