Skip to content
This repository has been archived by the owner on Jan 21, 2019. It is now read-only.

Latest commit

 

History

History
36 lines (24 loc) · 2.04 KB

File metadata and controls

36 lines (24 loc) · 2.04 KB

Bloom Filter

Wikipedia describes a Bloom filter as a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It uses multiple hash functions to determine membership in a set.

When an element is added to a filter, a subsequent membership test will definitely return True. However as more elements that are added to the set, the probability of false positives will increase. Elements are always added to the set, never removed. No Delete operation

Implementations

A nice interactive tutorial on the bloom filter Java implementation Another Java implementation but with less functionality Ruby implementation

Usage

  1. Performance when accessing data. Good to check if data exists before accesing physical disk, so there is less disk I/O
  2. Safe browsing chromium

How to implement one in Ruby

n = Number of items in the filter p = Probability of false positives, float between 0 and 1 or a number indicating 1-in-p m = Number of bits in the filter k = Number of hash functions

We need a hash function or multiple hash functions. We can use cryptographic hash md5 or sha1 from Ruby. I did not use Ruby's built-in hash because it already seeds a value and generates different hashes for each process, not that it would matter, since this experimental, but just in-case. We need a bitarray to set the values generated by the hash. The bit indices are calculated by generating the hash for ( string + i) count times. The hash is probably not a random distribution but still manages to set k bits.

The fp_rate is $$(1 - (1 - \frac{1}{f})^{n})^{k}$$. I simply calculate and don't use it much.