sketchnu.hyperloglog¶

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Too many problems with the experimental Numba class. Converting this to a series of Numba functions and a Python class.

Example

To get a HyperLogLog with precision of 16 and seed 0:

from sketchnu.hyperloglog import HyperLogLog

hll = HyperLogLog()
key = 'testing'.encode('utf-8')
hll.add(key)
# To get a estimated cardinality
hll.query()

class sketchnu.hyperloglog.HyperLogLog(p: int = 16, seed: int = 0, shared_memory: bool = False)[source]¶

Bases: object

Implementation of the HyperLogLog++ algorithm which uses a 64-bit hash function and corrects bias when cardinality is low.

Parameters

p (int) – Precision specifies the number of registers (2**p). The larger the p, the more accurate the estimated cardinality. Must be between [7, 16]
seed (int, optional) – Seed passed to the fasthash64 function. Default is 0
shared_memory (bool, optional) – If True, then HyperLogLog is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

p¶

Precision of the HyperLogLog. Must be between [7, 16]

Type: int

seed¶

Seed passed to hash function

Type: int

m¶

Number of registers. Equal to 2**p

Type: int

shared_memory¶

If True, then HyperLogLog is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

Type: bool, optional

add(key: bytes, value: int = 1) → None[source]¶

Add a single key to the HyperLogLog.

Note

Currently, the numba.typed.List needed for the update() are quite slow. Speedtest show that you are better off looping through the keys in python and add(key) instead of update(numba.typed.List(keys)). Taking just over twice as long.

Parameters

key (bytes) – Element to add to the HyperLogLog
value (int, optional) – Number of times to add key to the sketch. Ignored for HyperLogLog but included in order to have the same API as the other sketches. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]¶

Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters

key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size

Returns

Return type

None

attach_existing_shm(existing_shm_name: str) → multiprocessing.shared_memory.SharedMemory[source]¶

Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.

Parameters: existing_shm_name (str) – Name an existing shared memory block to attach this sketch to
Returns
Return type: None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶

Load a saved HyperLogLog sketch stored in filename

Parameters

filename (str | Path) – File path to the saved .npz file
shared_memory (bool) – If True, copy registers into shared memory

Returns

Return type

HyperLogLog

merge(other) → None[source]¶

Merge the HyperLogLog other into this one.

Parameters: other (HyperLogLog) – Another HyperLogLog with the same precision and seed
Returns
Return type: None
Raises: TypeError – If other has different precision or seed from this one

query() → float[source]¶

Return the estimated cardinality

Returns: Estimated cardinality of elements added to the HyperLogLog so far
Return type: float

save(filename: Union[str, pathlib.Path]) → None[source]¶

Save the HyperLogLog sketch, hll, to the file, filename

Parameters: filename (str | Path) – File to save the hll to disk. This will be a .npz file.
Returns
Return type: None

update(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]¶

Given a list of keys, update the HyperLogLog. This follows the convention of collections.Counter in that keys should be a list of keys or a dictionary whose keys are the keys. For this sketch, the value of the dictionary is ignored since it doesn’t make sense to put the same key in multiple times.

Note

Currently, the numba.typed.List needed for the update() are quite slow. Speedtest show that you are better off looping through the keys in python and add(key) instead of update(numba.typed.List(keys)). Taking just over twice as long. If repeated adding the same keys, then update is faster.

Parameters: keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the HyperLogLog at once. If a Dict is passed, then only the Dict.keys() are put into the sketch.
Returns
Return type: None

update_ngram(keys: bytes, ngram: int) → None[source]¶

Given a list of keys, split each into ngrams of size ngram, and then add them to the sketch.

Note

Speed tests show that you are slightly better off looping through the keys in Python and calling add_ngram() instead of update_ngram(). Used 50k keys, each of 200 bytes, and ngram=4 to find that update_ngram() was ~ 6% slower. This is significantly better than add() vs. update(). If repeated adding the same keys (for testing), then update_ngram() is just a little bit faster than add_ngram().

Parameters

keys (List[bytes]) – List of elements to be shingled before adding to the sketch.
ngram (int) – ngram size

Returns

Return type

None

sketchnu.countmin¶

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Too many problems with the experimental Numba class. Converting this to a series of Numba functions and a Python classes.

Implementation of count-min sketch using Numba. Three different types have been coded, but all use conservative updating to help reduce errors. The three types are

linear : Uses 32-bit linear counters
log16 : Uses 16-bit log counters
log8 : Uses 8-bit log counters

Each type is implemented as a Python class, but uses numba functions under the hood where possible (and faster). The convenience function, CountMin() is the recommended way to instantiate a count-min sketch. The Python classes are CountMinLinear, CountMinLog16, and CountMinLog8.

Example

To get a linear CountMin with width = 2**20

from sketchnu.countmin import CountMin

width = 2**20
cms = CountMin("linear", width)
key = "testing".encode("utf-8")
cms.add(key)
# To get an estimated count for that key use either
cms.query(key)
cms[key]

Each count-min sketch implementation has two special counters that are stored in the attribute n_added_records which is a 1-d array with two elements. The first records the total number of elements added to the count-min sketch. This is useful when calculating the error guarantees. Its value can be retrieved with n_added(). The second element is used by helpers.parallel_add() to store the number of records added to the sketch. This value is needed when using a count-min sketch to calculate the tf-idf value for a given record. The idf piece needs the number of records which have a given key, i.e. document frequency. This can be provided directly by the count-min sketch. But you also need to know the total number of records. This value can be retrieved by n_records() if you use helpers.parallel_add() to create the count-min sketch.

sketchnu.countmin.CountMin(cms_type: str, width: int, depth: int = 8, max_count: int = 4294967295, num_reserved: int = None, shared_memory: bool = False) → Union[sketchnu.countmin.CountMinLinear, sketchnu.countmin.CountMinLog16, sketchnu.countmin.CountMinLog8][source]¶

Convenience function to instantiate a count-min sketch of the given type.

Parameters

cms_type (str) – Must be ‘linear’ | ‘log16’ | ‘log8’
width (int) – Width of the count-min sketch. Best if you keep width >= n_unique/2
depth (int, optional) – Depth of the count-min sketch. Sets the number of different hash functions used. Probability of exceeding error limits is determined by the depth, exp(-depth). Default is 8 which should be fine for most circumstances.
max_count (int, optional) – Maximum value the count a given element may reach. Not used by ‘linear’ type. Default is 2**32 - 1 (4,294,967,295) to match linear type.
num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times an element is seen for counts <= num_reserved. Not used by the ‘linear’ type. Default is None which uses 15 for log8 and 1023 for log16. This must be less than the maximum uint value: 65,535 for log16 and 255 for log8.
shared_memory (bool, optional) – If True, then count-min sketch is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

Returns

cms – A count-min sketch of the requested type and size

Return type

CountMinLinear | CountMinLog16 | CountMinLog8

class sketchnu.countmin.CountMinLinear(width: int, depth: int = 8, shared_memory: bool = False)[source]¶

Bases: object

Count-min sketch that uses 32-bit linear counters with conservative updating. A given element’s maximum count is 2**32 - 1

Parameters

width (int) – Width of the count-min sketch. Must be non-negative
depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8
shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

width¶

Width of the 2-d array of counters of the count-min sketch

Type: np.uint64

depth¶

Depth of the 2-d array of counters of the count-min sketch

Type: np.uint64

shared_memory¶

Whether cms and n_added_records are attached to a shared memory block

Type: bool

cms¶

2-d array of the counters. Shape = (depth, width)

Type: np.uint32[:,:]

n_added_records¶

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.

Type: np.uint64[:]

add(key: bytes, value: int = 1) → None[source]¶

Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].

Parameters

key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. value will be capped at 4,294,967,295 to prevent overflows. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]¶

Take a given key and shingle it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters

key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size

Returns

Return type

None

attach_existing_shm(existing_shm_name: str) → None[source]¶

Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.

Parameters: existing_shm_name (str) – Name of an existing shared memory block to attach this sketch to
Returns
Return type: None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶

Load a saved CountMinLinear stored in filename

Parameters

filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

CountMinLinear

merge(other) → None[source]¶

Merge the count-min sketch other into this one.

Parameters: other (CountMinLinear) – Another CountMinLinear with the same width and depth.
Returns
Return type: None
Raises: TypeError – If other has different width, depth, or dtype

n_added() → numpy.uint64[source]¶

This special counter is used to track the total number of elements that have been added to the sketch. Useful to check the error guarantees.

Returns: The number of elements that have been added to the sketch.
Return type: np.uint64

n_records() → numpy.uint64[source]¶

This special counter is used by the sketchnu.helpers.parallel_add() to keep track of the number of records that have been added to the sketch. This can be used as the numerator of the idf piece of a tf-idf.

Returns: The number of records that have been added to the sketch.
Return type: np.uint64

query(key: bytes) → int[source]¶

Return estimated number of times key was added into the count-min sketch

Parameters: key (bytes) – Element whose estimated count you want returned
Returns
Return type: int

save(filename: Union[str, pathlib.Path]) → None[source]¶

Save the sketch to filename

Parameters: filename (str | Path) – File to save the sketch to disk. This will be a .npz file.
Returns
Return type: None

update(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]¶

Add keys to the sketch. This follows the convention of collections.Counter

Parameters: keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the sketch or a dictionary which can specify the number of times to add each key
Returns
Return type: None

update_ngram(keys: List[bytes], ngram: int) → None[source]¶

Given a list of keys, split each into ngrams of size ngram, and then add them to the sketch.

Note

Current implementation loops through the keys in Python. Speed testing showed that converting the list to numba’s typed list was very costly. Maybe if that gets faster a pure numba implementation can be done.

Parameters

keys (List[bytes]) – List of elements to be shingled before adding to the sketch.
ngram (int) – ngram size

Returns

Return type

None

class sketchnu.countmin.CountMinLog16(width: int, depth: int = 8, max_count=4294967295, num_reserved=1023, shared_memory: bool = False)[source]¶

Bases: sketchnu.countmin.CountMinLinear

Count-min sketch that uses 16-bit log counters with conservative updating.

Parameters

width (int) – Width of the count-min sketch. Must be non-negative.
depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8.
max_count (int, optional) – The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).
num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 1023. This must be less than 65,535 (2**16 - 1).
shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

width¶

Width of the 2-d array of counters of the count-min sketch

Type: np.uint64

depth¶

Depth of the 2-d array of counters of the count-min sketch

Type: np.uint64

max_count¶

The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).

Type: int, optional

num_reserved¶

Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 1023. This must be less than 65,535 (2**16 - 1).

Type: int, optional

shared_memory¶

Whether cms and n_added_records are attached to a shared memory block

Type: bool

cms¶

2-d array of the counters. Shape = (depth, width)

Type: np.uint32[:,:]

n_added_records¶

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.

Type: np.uint64[:]

base¶

Calculated base for the log counters. Depends affected by max_count and num_reserved.

Type: np.float64

add(key: bytes, value: int = 1) → None[source]¶

Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].

Parameters

key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]¶

Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters

key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size

Returns

Return type

None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶

Load a saved CountMinLog16 stored in filename

Parameters

filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

CountMinLog16

merge(other) → None[source]¶

Merge the count-min sketch other into this one.

Parameters: other (CountMinLog16) – Another CountMinLog16 with the same parameters.
Returns
Return type: None
Raises: TypeError – If other has different parameters

query(key: bytes) → float[source]¶

Return estimated number of times key was added into the count-min sketch

Parameters: key (bytes) – Element whose estimated count you want returned
Returns
Return type: float

save(filename: Union[str, pathlib.Path]) → None[source]¶

Save the count-min sketch to filename

Parameters: filename (str | Path) – File to save the hll to disk. This will be a .npz file.
Returns
Return type: None

class sketchnu.countmin.CountMinLog8(width: int, depth: int = 8, max_count=4294967295, num_reserved=15, shared_memory: bool = False)[source]¶

Bases: sketchnu.countmin.CountMinLog16

Count-min sketch that uses 8-bit log counters with conservative updating.

Parameters

width (int) – Width of the count-min sketch. Must be non-negative.
depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8.
max_count (int, optional) – The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).
num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 15. This must be less than 255 (2**8 - 1).
shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

width¶

Width of the 2-d array of counters of the count-min sketch

Type: np.uint64

depth¶

Depth of the 2-d array of counters of the count-min sketch

Type: np.uint64

max_count¶

The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).

Type: int, optional

num_reserved¶

Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 15. This must be less than 255 (2**8 - 1).

Type: int, optional

shared_memory¶

Whether cms and n_added_records are attached to a shared memory block

Type: bool

cms¶

2-d array of the counters. Shape = (depth, width)

Type: np.uint32[:,:]

n_added_records¶

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.

Type: np.uint64[:]

base¶

Calculated base for the log counters. Depends affected by max_count and num_reserved.

Type: np.float64

add(key: bytes, value: int = 1) → None[source]¶

Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].

Parameters

key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]¶

Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters

key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size

Returns

Return type

None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶

Load a saved CountMinLog8 stored in filename

Parameters

filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

CountMinLog8

merge(other) → None[source]¶

Merge the count-min sketch other into this one.

Parameters: other (CountMinLog8) – Another CountMinLog8 with the same parameters.
Returns
Return type: None
Raises: TypeError – If other has different parameters

query(key: bytes) → float[source]¶

Return estimated number of times key was added into the count-min sketch

Parameters: key (bytes) – Element whose estimated count you want returned
Returns
Return type: float

sketchnu.countmin.load(filename: Union[str, pathlib.Path], shared_memory: bool = False) → Union[sketchnu.countmin.CountMinLinear, sketchnu.countmin.CountMinLog16, sketchnu.countmin.CountMinLog8][source]¶

Load a saved count-min sketch stored in filename

Parameters

filename (str | Path) – File path to the saved .npz file
shared_memory (bool) – If True, load into shared memory

Returns

Return type

CountMinLinear | CountMinLog16 | CountMinLog8

sketchnu.heavyhitters¶

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Numba implementation of the topkapi algorithm from

A. Mandal, H. Jiang, A. Shrivastava, and V. Sarkar, “Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements”, Advances in Neural Information Processing Systems 31, (2018).

class sketchnu.heavyhitters.HeavyHitters(width: int, depth: int = 4, max_key_len: int = 16, phi: float = None, shared_memory: bool = False)[source]¶

Bases: object

Sketch implementation of the phi-heavy hitters algorithm which identifies all the keys in a data stream that are observed in at least phi fraction of the records. This assumes that keys have a fat-tailed distribution in the data stream. This is an implementation of the Topkapi algorithm.

The parameter phi must be greater than 1 / width of the sketch for the theoretical guarantees to be valid. The theoretical guarantees use a count-min sketch to estimate the frequency of any given key. For practical reasons, the paper suggests dropping the count-min sketch to save space. In this case you use the lhh_count as an estimate of the frequency of occurence for a given key. Note that lhh_count <= true count <= cms. So when we call the query function we are doing a more conservative estimate which corresponds to a higher phi. If you are feeling like you want to squeeze more out of the sketch you can provide a lower threshold (= phi * n_added()) when calling the query() function.

Parameters

width (int) – Width of the heavy hitters sketch. Must be non-negative
depth (int, optional) – Depth of the heavy hitters sketch. Must be non-negative. Default is 4
max_key_len (int, optional) – Maximum number of bytes any given key may have. Must be less than 256. Default is 16
phi (float, optional) – When generating the candidate set of heavy hitters, only keys whose estimated frequency of occurrence (lhh_count) >= phi * n_added() will be added to the candidate set. Default of None is set to 1 / width.
shared_memory (bool, optional) – If True, then sketch is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False

width¶

Width of the 2-d array of counters of the sketch

Type: np.uint64

depth¶

Depth of the 2-d array of counters of the sketch

Type: np.uint64

max_key_len¶

Maximum number of bytes any given key may have. Must be less than 256

Type: np.uint64

phi¶

When generating the candidate set of heavy hitters, only keys whose estimated frequency of occurrence (lhh_count) >= phi * n_added() will be added to the candidate set. Default of None is set to 1 / width.

Type: np.float64

lhh¶

Storing the keys associated with each bucket in the 2-d array of counters. Keys are stored as numpy arrays, as opposed to 2-d list of bytes, in order for numba to be able to process them. If a given key has fewer bytes than max_len_key, then right padded with 0s.

Type: np.ndarray, shape=(depth, width, max_key_len), dtype=np.uint8

lhh_count¶

Store the counts associated with keys stored in lhh.

Type: np.ndarray, shape=(depth, width), dtype=np.uint32

key_lens¶

The length of each of the keys stored in lhh

Type: np.ndarray, shape=(depth, width), dtype=np.uint8

n_added_records¶

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed.

Type: np.ndarray, shape=(2,), dtype=np.uint64

add(key: bytes, value: int = 1) → None[source]¶

Add a single key to the heavy hitters sketch and update the counter tracking total number of keys added to the sketch.

Parameters

key (bytes) – Element to be added to the sketch
value (int, optional) – Number of times to add key to the sketch. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]¶

Take a given key and shingle it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters

key (bytes) – Element to be shingled before adding to the sketch
ngram (int) – ngram size

Returns

Return type

None

attach_existing_shm(existing_shm_name: str) → None[source]¶

Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.

Parameters: existing_shm_name (str) – Name of an existing shared memory block to attach this sketch to
Returns
Return type: None

generate_candidate_set(threshold: int = None) → None[source]¶

Generate a candidate set of heavy hitters. Only keys in lhh whose corresponding counts in lhh_count are greater the threshold are included. Contrary to the paper, we all the rows instead of just the first one. This seems like a small price to pay to not lose candidates due to hash collisions. The candidate set is a collections.Counter stored in self.candidate_set

Parameters: threshold (int, optional) – If None (default), then uses threshold provided during instantiation
Returns
Return type: None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]¶

Load a saved HeavyHitters stored in filename

Parameters

filename (str | Path) – File path to the saved .npz file
shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

HeavyHitters

merge(other) → None[source]¶

Merge the HeavyHitter sketch other into this one.

Parameters: other (HeavyHitters) – Another HeavyHitters with the same width, depth, max_key_len.
Returns
Return type: None
Raises: TypeError – If other has different width, depth, or max_key_len

n_added() → numpy.uint64[source]¶

This special counter is used to track the total number of elements that have been added to the sketch. Useful to check the error guarantees.

Returns: The number of elements that have been added to the sketch.
Return type: np.uint64

n_records() → numpy.uint64[source]¶

This special counter is used by the sketchnu.helpers.parallel_add() to keep track of the number of records that have been added to the sketch.

Returns: The number of records that have been added to the sketch.
Return type: np.uint64

query(k: int, threshold: int = None) → List[Tuple[bytes, int]][source]¶

Return the top k heavy hitters. If new data has been added or if threshold is different from the last time a candidate set was generated, then this will generate a new candidate set before selecting the top k.

Parameters

k (int) –
threshold (int, optional) – Only include keys from lhh whose lhh_counts >= threshold. Default is None which then sets threshold to self.phi * self.n_added()

Returns

Sorted list of the (key, count) of the top k. Format is the same as collections.Counter().most_common().

Return type

List[Tuple[bytes, int]]

save(filename: Union[str, pathlib.Path]) → None[source]¶

Save the sketch to filename adding the .npz extension if not already part of filename

Parameters: filename (str | Path) – File to save the sketch to disk. This will be a .npz file.
Returns
Return type: None

update(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]¶

Add keys to the sketch. This follows the convention of collections.Counter

Parameters: keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the sketch or a dictionary which can specify the number of times to add each key
Returns
Return type: None

update_ngram(keys: List[bytes], ngram: int) → None[source]¶

Given a list of keys, shingle each into ngrams of size ngram, and then add them to the sketch.

Parameters

keys (List[bytes]) – List of elements to be shingled before adding to the sketch.
ngram (int) – ngram size

Returns

Return type

None

sketchnu.helpers¶

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Helper functions to aid in parallelizating the creation of sketches using Python’s multiprocessing.

sketchnu.helpers.attach_shared_memory(sketch_type: str, sketch_args: Dict, shm_name: str)[source]¶

Create a new sketch and attach it to the shared memory block. Sketch is created by passing sketch_args to either CountMin(), HeavyHitters() or HyperLogLog().

Parameters

sketch_type (str) – Specify the type of sketch to create. Must be “cms” | “hh” | “hll”.
sketch_args (Dict) – The associated arguments to instantiante a new sketch.
shm_name (str) – Name of a shared memory block that already exist. The new sketch will attach to this block.

Returns

local_sketch – A local sketch, relative to this process, that is attached to the shared memory block of an existing sketch that has shared_memory = True

Return type

CountMinLinear | CountMinLog16 | CountMinLog8 | HeavyHitters | HyperLogLog

sketchnu.helpers.parallel_add(items: Iterable, process_q_item: Callable[[...], int], n_workers: int = None, cms_args: Dict = None, hh_args: Dict = None, hll_args: Dict = None, **kwargs)[source]¶

Places items onto a queue to be processed by n_workers independent spawned processes.

The user defined function, process_q_item, takes the arguments (q_item, *sketches, **kwargs). This function is given a single item from the queue and then should add elements to the sketch(es) as desired. The function should return the number of records processed. If process_q_item will add elements to multiple sketches, then they must be listed in alphabetic order since that is how parallel_add will pass them to process_q_item.

The **kwargs are passed along to process_q_item to allow for any needed additional parameters.

Once all items have been processed, the n_workers sketch(s) are merged with the final result(s) returned.

You must provide at least cms_args | hh_args | hll_args. If you provide more than one, then the requested sketches will be processed at the same time while going over the data just once.

Note: If your data has duplicate keys within a item, you will likely see better performance if process_q_item aggregates the counts per key and then calls the sketch.add(key, count)

Parameters

items (Iterable) – A generator or list of items that will be placed onto a queue and then worked by one of the workers in a separate spawned process.
process_q_item (Callable[.., int]) – User defined function whose arguments are (q_item, *sketches, **kwargs) that takes the q_item, adds elements to the sketch(es) and returns the number of records that were processed. *sketches must be in alphabetic order since that is how they will be passed by parallel_add.
n_workers (int, optional) – Number of workers to use. Each will update their own sketches which will then get merged together to achieve the final sketch(s). If None (default), then set to psutil.cpu_count(logical=False)
cms_args (Dict, optional) – Dictionary containing arguments to instantiate a CountMin. If None (default) then don’t create a sketch of this type.
hh_args (Dict, optional) – Dictionary containing arguments to instantiate a HeavyHitters. If None (default) then don’t create a sketch of this type.
hll_args (Dict, optional) – Dictionary containing arguments to instantiate a HyperLogLog. If None (default) then don’t create a sketch of this type.
**kwargs – Keyword arguments that get passed to process_q_item function

Returns

The final sketch(s). If doing more than one sketch, then they are returned as a tuple in alphabetical order: cms, hh, hll

Return type

sketches

sketchnu.helpers.parallel_merging(sketch_array: List, log_queue: multiprocessing.context.BaseContext.Queue)[source]¶

Merge an array of sketches in successive rounds of pairing. This will use at most len(sketch_array) // 2 processes. After merging, the final sketch is returned.

Parameters

sketch_array (List) – Array containing sketches to be merged together
log_queue (Queue) – Log statements to a log Queue

Returns

The final sketch resulting from merging all of the sketches in the sketch_array.

Return type

CountMinLinear | CountMinLog16 | CountMinLog8 | HyperLogLog

sketchnu.hashes¶

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Numba implementations of the non-cryptographic hashing functions FastHash (32 & 64-bit) and MurmurHash3 (32-bit). The 64-bit FastHash is used in both the HyperLogLog and count-min sketch code.

sketchnu.hashes.fasthash32[source]¶

Calculate the unsigned 32-bit integer FastHash value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/

Parameters

key (bytes) – Bytes to be hashed
seed (uint64) – Seed to use when hashing

Returns

Unsigned 32-bit integer hash value of key

Return type

uint32

sketchnu.hashes.fasthash64[source]¶

Calculate the unsigned 64-bit integer FastHash value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/

Parameters

key (bytes) – Bytes to be hashed
seed (uint64) – Seed to use when hashing

Returns

Unsigned 64-bit integer hash value of key

Return type

uint64

sketchnu.hashes.murmur3[source]¶

Calculate the unsigned 32-bit integer MurmurHash3 value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/

Parameters

key (bytes) – Bytes to be hashed
seed (uint32) – Seed to use when hashing

Returns

Unsigned 32-bit integer hash value of key

Return type

uint32

sketchnu.hyperloglog¶

sketchnu.countmin¶

sketchnu.heavyhitters¶

sketchnu.helpers¶

sketchnu.hashes¶

sketchnu.tests¶