Wikipedia describes a Bloom filter as a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It uses multiple hash functions to determine membership in a set.
When an element is added to a filter, a subsequent membership test will definitely return True. However as more elements that are added to the set, the probability of false positives will increase. Elements are always added to the set, never removed. No Delete operation
A nice interactive tutorial on the bloom filter Java implementation Another Java implementation but with less functionality Ruby implementation
- Performance when accessing data. Good to check if data exists before accesing physical disk, so there is less disk I/O
- Safe browsing chromium
n = Number of items in the filter p = Probability of false positives, float between 0 and 1 or a number indicating 1-in-p m = Number of bits in the filter k = Number of hash functions
We need a hash function or multiple hash functions. We can use cryptographic hash md5 or sha1 from Ruby. I did not use Ruby's built-in hash because it already seeds a value and generates different hashes for each process, not that it would matter, since this experimental, but just in-case. We need a bitarray to set the values generated by the hash. The bit indices are calculated by generating the hash for ( string + i) count times. The hash is probably not a random distribution but still manages to set k bits.
The fp_rate is