Under the hood: How do Bloom filters work?

Let’s now delve into the details of Bloom filter implementation. A Bloom filter is made of two elements:

  • An array of m elements
  • A set of k hash functions

The array is (conceptually) an array of bits, each of which is initially set to 0; each hash function outputs an index between 0 and m-1.

It is crucial to clarify as soon as possible that there is not a 1-to-1 correspondence between the elements of the array and the keys we add to the Bloom filter. Rather, we will use k bits (and so k array’s elements) to store each entry for the Bloom filter; k here is typically much smaller than m.

Note that k is a constant that we pick when we create the data structure, so each and every entry we add is stored using the same amount of memory, exactly k bits. With string values, this is pretty amazing, because it means we can add arbitrarily long strings to our filter using a constant amount of memory, just k bits.

When we insert a new element key into the filter, we compute k indices for the array, given by the values h0 (key) up to h(k-1) (key) , and set those bits to 1.

When we look up an entry, we also need to compute the k hashes for it as described for insert, but this time we check the k bits at the indices returned by the hash functions and return true if and only if all bits at those positions are set to 1. Figure 4.3 shows both operations in action.

Ideally, we would need k different independent hash functions, so that no two indices are duplicated for the same value. It is not easy to design a large number of indepen­dent hash functions, but we can get good approximations. There are a few solutions commonly used:

  • Use a parametric hash function H(i). This meta-function, which is a generator of hash functions, takes in as input an initial value i and outputs a hash func­tion Hi=H(i) . During the Bloom filter’s initialization, we can create k of this functions, H0 to Hk-1, by calling the generator H on k different (and usually ran­dom) values.
  • Use a single hash H function but initialize a list L of k random (and unique) values. For each entry key that is inserted/searched, create k values by adding or appending L[i] to key, and then hash them using H. (Remember that well-designed hash functions will produce very different results for small changes in the input.)
  • Use double or triple hashing.

While the latter won’t guarantee independence among the hash functions generated, it has been proven that we can relax this constraint with a minimal increase in the false positive rate. To keep things simple, in our implementation we use double hash­ing with two independent hash functions: Murmur hashing and Fowler-Noll-Vo (fnvl) hashing.

The general format of our i-th hash function, for i between 0 and k-1, will be

hi(key) = murmurhash(key) + i * fnvl(key) + i * i

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *