Pure and impure Cuckoo Filter.
Please see the README on Github at https://github.com/ChrisCoffey/cuckoo-filter#readme
cuckoo-filter
Cuckoo filters are a probabilistic data structure used to answer questions like "Have I already seen this user" or "Is this word in the English language?". They're probabilistic because each membership operation has a false positive probability. It guarnatees that there will never be a false negative, but may have a low chance of false positives.
Bloom filters are the cannonical probabilistic filter structure, and cuckoo filters are a simlar but different tool. As a bloom filter's load factor increases, the chance of false positive trends towards 100%, but the inserts will never fail. On the other hand, a Cuckoo filter retains a relatively stable false positive probability under load, but as load approahes 95% inserts will begin to fail. In either case you probably want to resize your filter...
This implementation has the following properties:
- Buckets of 4 elements
- 8 bit fingerprints
- Cycle termination during item kicking occurs after (0.1 * size) buckets have been checked.
- Size may be any non-zero natural number (not limited to powers of 2)
For more details about how Cuckoo filters work, I recommend you read Fan et. al.'s 2016 paper https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf.
Usage
Cuckoo filters support three operations: insert
, member
, and delete
. See the haddocks for details.
Performance
As you'll find in the criterion results, the pure version of the filter can handle ~1.6 million insertions/s. From memory profiles, the vast majority of the memory is taken up by the underlying implementation of Filter
, so this is an obvious area for improvement.
The current implementation avoids pre-allocating memory for the filter, so the heap usage will incrase linearly with insert
calls. This obviously helps keep heap usage low for sparse filters, but also means inserts are slower than they would be in a mutable implementation.
Loading a SpellChecker test
The following test was run on a laptop, so the absolute numbers are going to vary a ton. The important thing is the relationship between the pure & immutable filter implementations.
The test consists of:
- Load the
/usr/share/dict/words
file into memory - Create a filter containing all of the words
- Lookup each word in the filter
Pure
500000 cells
235886 words
0.078749ss to count words
0.933969ss to construct filter
745 insert failures
0.80465ss to query every element
Mutable
500000 cells
235886 words
0.082926ss to count words
0.29735ss to construct filter
582 insert failures
0.52605ss to query every element
Incredibly unscientific comparison to bloom-filter
using a vanilla filter
235886 words
0.087499ss to count words
Bloom { 4194304 bits }
0.464982ss to construct filter
0.506902ss to query every element
*** Cuckoo Filters report the number of failures, while the Bloom Filter reports how many bits it contains. I'll start capturing size for the mutable Cuckoo Filter soon.