Kevin Kreutz:
The Count-min sketch with and without conservative update
Kurzbeschreibung
The Count-min sketch is a data structure, which uses the probabilistic nature of a hash table to approximate the frequencies of elements in a data stream.
Compared to an exact counter that does not use hash functions, the Count-min sketch has advantages in its update time and space needed. The disadvantage of the Count-min sketch is that the count is not exact and has an error rate that depends on two parameters: the number of hash functions used and the size of their codomains. The theoretical analysis of Cormode and Muthukrishnan is restated in this thesis. It provides a mathematically proven method on how to pick these parameters for a wanted error range. This thesis also explains how the Count-min sketch and its conservative update variant work in theory.
The aim of the conducted experiments is to determine how the parameters should be picked. For this, the results of the theoretical analysis are used in a python3 implementation of both variants.
We show that the Count-min sketch works well at approximating the most frequent elements in language data, by conducting experiments on the complete works of Shakespeare. Then we show that on the contrary this is not the case with data, where the distribution of elements is uniform. It is also shown that the conservative update is more accurate in both data sets. For this reason, it is advised to use this variant in practice if possible.