sketchnu.hyperloglog

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

Copyright (C) 2022 Matthew Hendrey

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Too many problems with the experimental Numba class. Converting this to a series of Numba functions and a Python class.

Example

To get a HyperLogLog with precision of 16 and seed 0:

from sketchnu.hyperloglog import HyperLogLog

hll = HyperLogLog()
key = 'testing'.encode('utf-8')
hll.add(key)
# To get a estimated cardinality
hll.query()
class sketchnu.hyperloglog.HyperLogLog(p: int = 16, seed: int = 0, shared_memory: bool = False)[source]

Bases: object

Implementation of the HyperLogLog++ algorithm which uses a 64-bit hash function and corrects bias when cardinality is low.

Parameters
  • p (int) – Precision specifies the number of registers (2**p). The larger the p, the more accurate the estimated cardinality. Must be between [7, 16]

  • seed (int, optional) – Seed passed to the fasthash64 function. Default is 0

  • shared_memory (bool, optional) – If True, then HyperLogLog is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

p

Precision of the HyperLogLog. Must be between [7, 16]

Type

int

seed

Seed passed to hash function

Type

int

m

Number of registers. Equal to 2**p

Type

int

shared_memory

If True, then HyperLogLog is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

Type

bool, optional

add(key: bytes, value: int = 1) → None[source]

Add a single key to the HyperLogLog.

Note

Currently, the numba.typed.List needed for the update() are quite slow. Speedtest show that you are better off looping through the keys in python and add(key) instead of update(numba.typed.List(keys)). Taking just over twice as long.

Parameters
  • key (bytes) – Element to add to the HyperLogLog

  • value (int, optional) – Number of times to add key to the sketch. Ignored for HyperLogLog but included in order to have the same API as the other sketches. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]

Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters
  • key (bytes) – Element to be shingled before adding to the sketch

  • ngram (int) – ngram size

Returns

Return type

None

attach_existing_shm(existing_shm_name: str) → multiprocessing.shared_memory.SharedMemory[source]

Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.

Parameters

existing_shm_name (str) – Name an existing shared memory block to attach this sketch to

Returns

Return type

None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]

Load a saved HyperLogLog sketch stored in filename

Parameters
  • filename (str | Path) – File path to the saved .npz file

  • shared_memory (bool) – If True, copy registers into shared memory

Returns

Return type

HyperLogLog

merge(other) → None[source]

Merge the HyperLogLog other into this one.

Parameters

other (HyperLogLog) – Another HyperLogLog with the same precision and seed

Returns

Return type

None

Raises

TypeError – If other has different precision or seed from this one

query() → float[source]

Return the estimated cardinality

Returns

Estimated cardinality of elements added to the HyperLogLog so far

Return type

float

save(filename: Union[str, pathlib.Path]) → None[source]

Save the HyperLogLog sketch, hll, to the file, filename

Parameters

filename (str | Path) – File to save the hll to disk. This will be a .npz file.

Returns

Return type

None

update(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]

Given a list of keys, update the HyperLogLog. This follows the convention of collections.Counter in that keys should be a list of keys or a dictionary whose keys are the keys. For this sketch, the value of the dictionary is ignored since it doesn’t make sense to put the same key in multiple times.

Note

Currently, the numba.typed.List needed for the update() are quite slow. Speedtest show that you are better off looping through the keys in python and add(key) instead of update(numba.typed.List(keys)). Taking just over twice as long. If repeated adding the same keys, then update is faster.

Parameters

keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the HyperLogLog at once. If a Dict is passed, then only the Dict.keys() are put into the sketch.

Returns

Return type

None

update_ngram(keys: bytes, ngram: int) → None[source]

Given a list of keys, split each into ngrams of size ngram, and then add them to the sketch.

Note

Speed tests show that you are slightly better off looping through the keys in Python and calling add_ngram() instead of update_ngram(). Used 50k keys, each of 200 bytes, and ngram=4 to find that update_ngram() was ~ 6% slower. This is significantly better than add() vs. update(). If repeated adding the same keys (for testing), then update_ngram() is just a little bit faster than add_ngram().

Parameters
  • keys (List[bytes]) – List of elements to be shingled before adding to the sketch.

  • ngram (int) – ngram size

Returns

Return type

None

sketchnu.countmin

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

Copyright (C) 2022 Matthew Hendrey

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Too many problems with the experimental Numba class. Converting this to a series of Numba functions and a Python classes.

Implementation of count-min sketch using Numba. Three different types have been coded, but all use conservative updating to help reduce errors. The three types are

  • linear : Uses 32-bit linear counters

  • log16 : Uses 16-bit log counters

  • log8 : Uses 8-bit log counters

Each type is implemented as a Python class, but uses numba functions under the hood where possible (and faster). The convenience function, CountMin() is the recommended way to instantiate a count-min sketch. The Python classes are CountMinLinear, CountMinLog16, and CountMinLog8.

Example

To get a linear CountMin with width = 2**20

from sketchnu.countmin import CountMin

width = 2**20
cms = CountMin("linear", width)
key = "testing".encode("utf-8")
cms.add(key)
# To get an estimated count for that key use either
cms.query(key)
cms[key]

Each count-min sketch implementation has two special counters that are stored in the attribute n_added_records which is a 1-d array with two elements. The first records the total number of elements added to the count-min sketch. This is useful when calculating the error guarantees. Its value can be retrieved with n_added(). The second element is used by helpers.parallel_add() to store the number of records added to the sketch. This value is needed when using a count-min sketch to calculate the tf-idf value for a given record. The idf piece needs the number of records which have a given key, i.e. document frequency. This can be provided directly by the count-min sketch. But you also need to know the total number of records. This value can be retrieved by n_records() if you use helpers.parallel_add() to create the count-min sketch.

sketchnu.countmin.CountMin(cms_type: str, width: int, depth: int = 8, max_count: int = 4294967295, num_reserved: int = None, shared_memory: bool = False) → Union[sketchnu.countmin.CountMinLinear, sketchnu.countmin.CountMinLog16, sketchnu.countmin.CountMinLog8][source]

Convenience function to instantiate a count-min sketch of the given type.

Parameters
  • cms_type (str) – Must be ‘linear’ | ‘log16’ | ‘log8’

  • width (int) – Width of the count-min sketch. Best if you keep width >= n_unique/2

  • depth (int, optional) – Depth of the count-min sketch. Sets the number of different hash functions used. Probability of exceeding error limits is determined by the depth, exp(-depth). Default is 8 which should be fine for most circumstances.

  • max_count (int, optional) – Maximum value the count a given element may reach. Not used by ‘linear’ type. Default is 2**32 - 1 (4,294,967,295) to match linear type.

  • num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times an element is seen for counts <= num_reserved. Not used by the ‘linear’ type. Default is None which uses 15 for log8 and 1023 for log16. This must be less than the maximum uint value: 65,535 for log16 and 255 for log8.

  • shared_memory (bool, optional) – If True, then count-min sketch is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

Returns

cms – A count-min sketch of the requested type and size

Return type

CountMinLinear | CountMinLog16 | CountMinLog8

class sketchnu.countmin.CountMinLinear(width: int, depth: int = 8, shared_memory: bool = False)[source]

Bases: object

Count-min sketch that uses 32-bit linear counters with conservative updating. A given element’s maximum count is 2**32 - 1

Parameters
  • width (int) – Width of the count-min sketch. Must be non-negative

  • depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8

  • shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

width

Width of the 2-d array of counters of the count-min sketch

Type

np.uint64

depth

Depth of the 2-d array of counters of the count-min sketch

Type

np.uint64

shared_memory

Whether cms and n_added_records are attached to a shared memory block

Type

bool

cms

2-d array of the counters. Shape = (depth, width)

Type

np.uint32[:,:]

n_added_records

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.

Type

np.uint64[:]

add(key: bytes, value: int = 1) → None[source]

Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].

Parameters
  • key (bytes) – Element to be added to the sketch

  • value (int, optional) – Number of times to add key to the sketch. value will be capped at 4,294,967,295 to prevent overflows. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]

Take a given key and shingle it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters
  • key (bytes) – Element to be shingled before adding to the sketch

  • ngram (int) – ngram size

Returns

Return type

None

attach_existing_shm(existing_shm_name: str) → None[source]

Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.

Parameters

existing_shm_name (str) – Name of an existing shared memory block to attach this sketch to

Returns

Return type

None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]

Load a saved CountMinLinear stored in filename

Parameters
  • filename (str | Path) – File path to the saved .npz file

  • shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

CountMinLinear

merge(other) → None[source]

Merge the count-min sketch other into this one.

Parameters

other (CountMinLinear) – Another CountMinLinear with the same width and depth.

Returns

Return type

None

Raises

TypeError – If other has different width, depth, or dtype

n_added() → numpy.uint64[source]

This special counter is used to track the total number of elements that have been added to the sketch. Useful to check the error guarantees.

Returns

The number of elements that have been added to the sketch.

Return type

np.uint64

n_records() → numpy.uint64[source]

This special counter is used by the sketchnu.helpers.parallel_add() to keep track of the number of records that have been added to the sketch. This can be used as the numerator of the idf piece of a tf-idf.

Returns

The number of records that have been added to the sketch.

Return type

np.uint64

query(key: bytes) → int[source]

Return estimated number of times key was added into the count-min sketch

Parameters

key (bytes) – Element whose estimated count you want returned

Returns

Return type

int

save(filename: Union[str, pathlib.Path]) → None[source]

Save the sketch to filename

Parameters

filename (str | Path) – File to save the sketch to disk. This will be a .npz file.

Returns

Return type

None

update(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]

Add keys to the sketch. This follows the convention of collections.Counter

Parameters

keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the sketch or a dictionary which can specify the number of times to add each key

Returns

Return type

None

update_ngram(keys: List[bytes], ngram: int) → None[source]

Given a list of keys, split each into ngrams of size ngram, and then add them to the sketch.

Note

Current implementation loops through the keys in Python. Speed testing showed that converting the list to numba’s typed list was very costly. Maybe if that gets faster a pure numba implementation can be done.

Parameters
  • keys (List[bytes]) – List of elements to be shingled before adding to the sketch.

  • ngram (int) – ngram size

Returns

Return type

None

class sketchnu.countmin.CountMinLog16(width: int, depth: int = 8, max_count=4294967295, num_reserved=1023, shared_memory: bool = False)[source]

Bases: sketchnu.countmin.CountMinLinear

Count-min sketch that uses 16-bit log counters with conservative updating.

Parameters
  • width (int) – Width of the count-min sketch. Must be non-negative.

  • depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8.

  • max_count (int, optional) – The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).

  • num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 1023. This must be less than 65,535 (2**16 - 1).

  • shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

width

Width of the 2-d array of counters of the count-min sketch

Type

np.uint64

depth

Depth of the 2-d array of counters of the count-min sketch

Type

np.uint64

max_count

The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).

Type

int, optional

num_reserved

Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 1023. This must be less than 65,535 (2**16 - 1).

Type

int, optional

shared_memory

Whether cms and n_added_records are attached to a shared memory block

Type

bool

cms

2-d array of the counters. Shape = (depth, width)

Type

np.uint32[:,:]

n_added_records

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.

Type

np.uint64[:]

base

Calculated base for the log counters. Depends affected by max_count and num_reserved.

Type

np.float64

add(key: bytes, value: int = 1) → None[source]

Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].

Parameters
  • key (bytes) – Element to be added to the sketch

  • value (int, optional) – Number of times to add key to the sketch. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]

Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters
  • key (bytes) – Element to be shingled before adding to the sketch

  • ngram (int) – ngram size

Returns

Return type

None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]

Load a saved CountMinLog16 stored in filename

Parameters
  • filename (str | Path) – File path to the saved .npz file

  • shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

CountMinLog16

merge(other) → None[source]

Merge the count-min sketch other into this one.

Parameters

other (CountMinLog16) – Another CountMinLog16 with the same parameters.

Returns

Return type

None

Raises

TypeError – If other has different parameters

query(key: bytes) → float[source]

Return estimated number of times key was added into the count-min sketch

Parameters

key (bytes) – Element whose estimated count you want returned

Returns

Return type

float

save(filename: Union[str, pathlib.Path]) → None[source]

Save the count-min sketch to filename

Parameters

filename (str | Path) – File to save the hll to disk. This will be a .npz file.

Returns

Return type

None

class sketchnu.countmin.CountMinLog8(width: int, depth: int = 8, max_count=4294967295, num_reserved=15, shared_memory: bool = False)[source]

Bases: sketchnu.countmin.CountMinLog16

Count-min sketch that uses 8-bit log counters with conservative updating.

Parameters
  • width (int) – Width of the count-min sketch. Must be non-negative.

  • depth (int, optional) – Depth of the count-min sketch. Must be non-negative. Default is 8.

  • max_count (int, optional) – The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).

  • num_reserved (int, optional) – Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 15. This must be less than 255 (2**8 - 1).

  • shared_memory (bool, optional) – If True, then CountMinLinear is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False.

width

Width of the 2-d array of counters of the count-min sketch

Type

np.uint64

depth

Depth of the 2-d array of counters of the count-min sketch

Type

np.uint64

max_count

The maximum value we want to count up to for any given key. Default is 2**32 -1 (4,294,967,295).

Type

int, optional

num_reserved

Perform linear counting for values [0, num_reserved]. After that use log counters. This gives more precise estimates for the number of times a key is seen for counts <= num_reserved. Default is 15. This must be less than 255 (2**8 - 1).

Type

int, optional

shared_memory

Whether cms and n_added_records are attached to a shared memory block

Type

bool

cms

2-d array of the counters. Shape = (depth, width)

Type

np.uint32[:,:]

n_added_records

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed. Useful if you want to calculate a TF-IDF.

Type

np.uint64[:]

base

Calculated base for the log counters. Depends affected by max_count and num_reserved.

Type

np.float64

add(key: bytes, value: int = 1) → None[source]

Add a single key to the count-min sketch and update the counter tracking total number of keys added to the count-min sketch. This is in n_added_records[0].

Parameters
  • key (bytes) – Element to be added to the sketch

  • value (int, optional) – Number of times to add key to the sketch. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]

Take a given key and split it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters
  • key (bytes) – Element to be shingled before adding to the sketch

  • ngram (int) – ngram size

Returns

Return type

None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]

Load a saved CountMinLog8 stored in filename

Parameters
  • filename (str | Path) – File path to the saved .npz file

  • shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

CountMinLog8

merge(other) → None[source]

Merge the count-min sketch other into this one.

Parameters

other (CountMinLog8) – Another CountMinLog8 with the same parameters.

Returns

Return type

None

Raises

TypeError – If other has different parameters

query(key: bytes) → float[source]

Return estimated number of times key was added into the count-min sketch

Parameters

key (bytes) – Element whose estimated count you want returned

Returns

Return type

float

sketchnu.countmin.load(filename: Union[str, pathlib.Path], shared_memory: bool = False) → Union[sketchnu.countmin.CountMinLinear, sketchnu.countmin.CountMinLog16, sketchnu.countmin.CountMinLog8][source]

Load a saved count-min sketch stored in filename

Parameters
  • filename (str | Path) – File path to the saved .npz file

  • shared_memory (bool) – If True, load into shared memory

Returns

Return type

CountMinLinear | CountMinLog16 | CountMinLog8

sketchnu.heavyhitters

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

Copyright (C) 2022 Matthew Hendrey

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Numba implementation of the topkapi algorithm from

A. Mandal, H. Jiang, A. Shrivastava, and V. Sarkar, “Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements”, Advances in Neural Information Processing Systems 31, (2018).

class sketchnu.heavyhitters.HeavyHitters(width: int, depth: int = 4, max_key_len: int = 16, phi: float = None, shared_memory: bool = False)[source]

Bases: object

Sketch implementation of the phi-heavy hitters algorithm which identifies all the keys in a data stream that are observed in at least phi fraction of the records. This assumes that keys have a fat-tailed distribution in the data stream. This is an implementation of the Topkapi algorithm.

The parameter phi must be greater than 1 / width of the sketch for the theoretical guarantees to be valid. The theoretical guarantees use a count-min sketch to estimate the frequency of any given key. For practical reasons, the paper suggests dropping the count-min sketch to save space. In this case you use the lhh_count as an estimate of the frequency of occurence for a given key. Note that lhh_count <= true count <= cms. So when we call the query function we are doing a more conservative estimate which corresponds to a higher phi. If you are feeling like you want to squeeze more out of the sketch you can provide a lower threshold (= phi * n_added()) when calling the query() function.

Parameters
  • width (int) – Width of the heavy hitters sketch. Must be non-negative

  • depth (int, optional) – Depth of the heavy hitters sketch. Must be non-negative. Default is 4

  • max_key_len (int, optional) – Maximum number of bytes any given key may have. Must be less than 256. Default is 16

  • phi (float, optional) – When generating the candidate set of heavy hitters, only keys whose estimated frequency of occurrence (lhh_count) >= phi * n_added() will be added to the candidate set. Default of None is set to 1 / width.

  • shared_memory (bool, optional) – If True, then sketch is placed in shared memory. Needed if performing multiprocessing as sketchnu.helpers.parallel_add() does. Default is False

width

Width of the 2-d array of counters of the sketch

Type

np.uint64

depth

Depth of the 2-d array of counters of the sketch

Type

np.uint64

max_key_len

Maximum number of bytes any given key may have. Must be less than 256

Type

np.uint64

phi

When generating the candidate set of heavy hitters, only keys whose estimated frequency of occurrence (lhh_count) >= phi * n_added() will be added to the candidate set. Default of None is set to 1 / width.

Type

np.float64

lhh

Storing the keys associated with each bucket in the 2-d array of counters. Keys are stored as numpy arrays, as opposed to 2-d list of bytes, in order for numba to be able to process them. If a given key has fewer bytes than max_len_key, then right padded with 0s.

Type

np.ndarray, shape=(depth, width, max_key_len), dtype=np.uint8

lhh_count

Store the counts associated with keys stored in lhh.

Type

np.ndarray, shape=(depth, width), dtype=np.uint32

key_lens

The length of each of the keys stored in lhh

Type

np.ndarray, shape=(depth, width), dtype=np.uint8

n_added_records

1-d array that holds two special counters. The first is the number of elements that have been added to the sketch. Useful for calculating error limits. The second is used by helpers.parallel_add() to keep track of the number of records that have been processed.

Type

np.ndarray, shape=(2,), dtype=np.uint64

add(key: bytes, value: int = 1) → None[source]

Add a single key to the heavy hitters sketch and update the counter tracking total number of keys added to the sketch.

Parameters
  • key (bytes) – Element to be added to the sketch

  • value (int, optional) – Number of times to add key to the sketch. Default is 1

Returns

Return type

None

add_ngram(key: bytes, ngram: int) → None[source]

Take a given key and shingle it into ngrams of size ngram and then add the ngrams to the sketch. If the key length is less than ngram then add the whole key

Parameters
  • key (bytes) – Element to be shingled before adding to the sketch

  • ngram (int) – ngram size

Returns

Return type

None

attach_existing_shm(existing_shm_name: str) → None[source]

Attach this sketch to an existing shared memory block. Useful when working within a spawned child process. This creates self.existing_shm which gets closed when self.__del__ gets called.

Parameters

existing_shm_name (str) – Name of an existing shared memory block to attach this sketch to

Returns

Return type

None

generate_candidate_set(threshold: int = None) → None[source]

Generate a candidate set of heavy hitters. Only keys in lhh whose corresponding counts in lhh_count are greater the threshold are included. Contrary to the paper, we all the rows instead of just the first one. This seems like a small price to pay to not lose candidates due to hash collisions. The candidate set is a collections.Counter stored in self.candidate_set

Parameters

threshold (int, optional) – If None (default), then uses threshold provided during instantiation

Returns

Return type

None

static load(filename: Union[str, pathlib.Path], shared_memory: bool = False)[source]

Load a saved HeavyHitters stored in filename

Parameters
  • filename (str | Path) – File path to the saved .npz file

  • shared_memory (bool, optional) – If True, load into shared memory. Default is False.

Returns

Return type

HeavyHitters

merge(other) → None[source]

Merge the HeavyHitter sketch other into this one.

Parameters

other (HeavyHitters) – Another HeavyHitters with the same width, depth, max_key_len.

Returns

Return type

None

Raises

TypeError – If other has different width, depth, or max_key_len

n_added() → numpy.uint64[source]

This special counter is used to track the total number of elements that have been added to the sketch. Useful to check the error guarantees.

Returns

The number of elements that have been added to the sketch.

Return type

np.uint64

n_records() → numpy.uint64[source]

This special counter is used by the sketchnu.helpers.parallel_add() to keep track of the number of records that have been added to the sketch.

Returns

The number of records that have been added to the sketch.

Return type

np.uint64

query(k: int, threshold: int = None) → List[Tuple[bytes, int]][source]

Return the top k heavy hitters. If new data has been added or if threshold is different from the last time a candidate set was generated, then this will generate a new candidate set before selecting the top k.

Parameters
  • k (int) –

  • threshold (int, optional) – Only include keys from lhh whose lhh_counts >= threshold. Default is None which then sets threshold to self.phi * self.n_added()

Returns

Sorted list of the (key, count) of the top k. Format is the same as collections.Counter().most_common().

Return type

List[Tuple[bytes, int]]

save(filename: Union[str, pathlib.Path]) → None[source]

Save the sketch to filename adding the .npz extension if not already part of filename

Parameters

filename (str | Path) – File to save the sketch to disk. This will be a .npz file.

Returns

Return type

None

update(keys: Union[List[bytes], Dict[bytes, int]]) → None[source]

Add keys to the sketch. This follows the convention of collections.Counter

Parameters

keys (List[bytes] | Dict[bytes, int]) – List of elements to add to the sketch or a dictionary which can specify the number of times to add each key

Returns

Return type

None

update_ngram(keys: List[bytes], ngram: int) → None[source]

Given a list of keys, shingle each into ngrams of size ngram, and then add them to the sketch.

Parameters
  • keys (List[bytes]) – List of elements to be shingled before adding to the sketch.

  • ngram (int) – ngram size

Returns

Return type

None

sketchnu.helpers

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

Copyright (C) 2022 Matthew Hendrey

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Helper functions to aid in parallelizating the creation of sketches using Python’s multiprocessing.

sketchnu.helpers.attach_shared_memory(sketch_type: str, sketch_args: Dict, shm_name: str)[source]

Create a new sketch and attach it to the shared memory block. Sketch is created by passing sketch_args to either CountMin(), HeavyHitters() or HyperLogLog().

Parameters
  • sketch_type (str) – Specify the type of sketch to create. Must be “cms” | “hh” | “hll”.

  • sketch_args (Dict) – The associated arguments to instantiante a new sketch.

  • shm_name (str) – Name of a shared memory block that already exist. The new sketch will attach to this block.

Returns

local_sketch – A local sketch, relative to this process, that is attached to the shared memory block of an existing sketch that has shared_memory = True

Return type

CountMinLinear | CountMinLog16 | CountMinLog8 | HeavyHitters | HyperLogLog

sketchnu.helpers.parallel_add(items: Iterable, process_q_item: Callable[[...], int], n_workers: int = None, cms_args: Dict = None, hh_args: Dict = None, hll_args: Dict = None, **kwargs)[source]

Places items onto a queue to be processed by n_workers independent spawned processes.

The user defined function, process_q_item, takes the arguments (q_item, *sketches, **kwargs). This function is given a single item from the queue and then should add elements to the sketch(es) as desired. The function should return the number of records processed. If process_q_item will add elements to multiple sketches, then they must be listed in alphabetic order since that is how parallel_add will pass them to process_q_item.

The **kwargs are passed along to process_q_item to allow for any needed additional parameters.

Once all items have been processed, the n_workers sketch(s) are merged with the final result(s) returned.

You must provide at least cms_args | hh_args | hll_args. If you provide more than one, then the requested sketches will be processed at the same time while going over the data just once.

Note: If your data has duplicate keys within a item, you will likely see better performance if process_q_item aggregates the counts per key and then calls the sketch.add(key, count)

Parameters
  • items (Iterable) – A generator or list of items that will be placed onto a queue and then worked by one of the workers in a separate spawned process.

  • process_q_item (Callable[.., int]) – User defined function whose arguments are (q_item, *sketches, **kwargs) that takes the q_item, adds elements to the sketch(es) and returns the number of records that were processed. *sketches must be in alphabetic order since that is how they will be passed by parallel_add.

  • n_workers (int, optional) – Number of workers to use. Each will update their own sketches which will then get merged together to achieve the final sketch(s). If None (default), then set to psutil.cpu_count(logical=False)

  • cms_args (Dict, optional) – Dictionary containing arguments to instantiate a CountMin. If None (default) then don’t create a sketch of this type.

  • hh_args (Dict, optional) – Dictionary containing arguments to instantiate a HeavyHitters. If None (default) then don’t create a sketch of this type.

  • hll_args (Dict, optional) – Dictionary containing arguments to instantiate a HyperLogLog. If None (default) then don’t create a sketch of this type.

  • **kwargs – Keyword arguments that get passed to process_q_item function

Returns

The final sketch(s). If doing more than one sketch, then they are returned as a tuple in alphabetical order: cms, hh, hll

Return type

sketches

sketchnu.helpers.parallel_merging(sketch_array: List, log_queue: multiprocessing.context.BaseContext.Queue)[source]

Merge an array of sketches in successive rounds of pairing. This will use at most len(sketch_array) // 2 processes. After merging, the final sketch is returned.

Parameters
  • sketch_array (List) – Array containing sketches to be merged together

  • log_queue (Queue) – Log statements to a log Queue

Returns

The final sketch resulting from merging all of the sketches in the sketch_array.

Return type

CountMinLinear | CountMinLog16 | CountMinLog8 | HyperLogLog

sketchnu.hashes

Sketchnu has Numba implementations of sketch algorithms and other useful functions that utilize hash functions.

Copyright (C) 2022 Matthew Hendrey

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

Numba implementations of the non-cryptographic hashing functions FastHash (32 & 64-bit) and MurmurHash3 (32-bit). The 64-bit FastHash is used in both the HyperLogLog and count-min sketch code.

sketchnu.hashes.fasthash32[source]

Calculate the unsigned 32-bit integer FastHash value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/

Parameters
  • key (bytes) – Bytes to be hashed

  • seed (uint64) – Seed to use when hashing

Returns

Unsigned 32-bit integer hash value of key

Return type

uint32

sketchnu.hashes.fasthash64[source]

Calculate the unsigned 64-bit integer FastHash value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/

Parameters
  • key (bytes) – Bytes to be hashed

  • seed (uint64) – Seed to use when hashing

Returns

Unsigned 64-bit integer hash value of key

Return type

uint64

sketchnu.hashes.murmur3[source]

Calculate the unsigned 32-bit integer MurmurHash3 value of key with the given seed. Code adapted from https://github.com/rurban/smhasher/

Parameters
  • key (bytes) – Bytes to be hashed

  • seed (uint32) – Seed to use when hashing

Returns

Unsigned 32-bit integer hash value of key

Return type

uint32

sketchnu.tests