Why Bloom filters work

So far, we have asked you to take for granted that Bloom filters do work as we described. Now it’s time to look more closely and explain why a Bloom filter actually works. Although this section is not strictly needed to implement or use Bloom filters, reading it might help you understand this data structure in more depth.

As already mentioned, Bloom filters are a tradeoff between memory and accuracy. If you are going to create an instance of a Bloom filter with a storage capacity of 8 bits and then try to store 1 million objects in it, chances are that you won’t get great perfor­mance. On the contrary, considering a Bloom filter with an 8-bit buffer, its whole buffer would be set to 1 after approximately 10-20 hashes. At that point, all calls to contains will just return true, and you will not be able to understand whether or not an object was actually stored in the container. Figure 4.6 shows an example of such a situation.

If, instead, we allocate sufficient space and choose our hash functions well, then the indices generated for each key won’t clash, and for any two different keys the over­lap between the lists of indices generated will be minimal, if any.

But how much space is sufficient? Internally a Bloom filter translates each key into a sequence of k indices chosen out of m possible alternatives:[1] the trick is that we store keys efficiently by flipping these k bits in the filter’s buffer, a bit-array.

If you have brushed up on your school algebra recently, you might have guessed that we can only represent mk different sequences of k elements drawn from m values; we need, however, to make two considerations:

  1. We actually can’t even use all these sequences. We would like all indices associ­ated to a key to be different (otherwise, we would actually store less than k bits per key), so we strive for all lists of k indices to be duplicate-free.
  2. We are not really interested in the order of these k indices. It’s completely irrel­evant if we first write the bit at index 0 and then the bit at index 3, or vice versa. We can thus consider sets instead of sequences.

All considered, we only (at least in theory) allow a fraction of all the possible sets of exactly k (distinct) indices drawn from the range 0. .m-1, and the number of all these possible (valid) sets is given by

The binomial coefficient in this formula expresses the number of ways we can extract k (unique) elements from a set of size m, and hence it tells us how many sets with exactly k distinct indices our k hash functions can return.

If we want each key to map to a different set of k indices, then given k and n, the number of keys to store, we can compute a lower bound for m, the size of the buffer, by using the formula above.

Another way to look at the question is that given a sequence of m bits (the buffer), we can only represent 2m different values, so at a given time the Bloom filter can only be in one of 2m possible states; however, this only gives us a loose (although easier to compute) bound on n, because we would not take into consideration that we will store k bits per key (2m becomes an exact bound only for k==1).

We’ll see in section 4.10 how to choose the number of hash functions and the size of the array in order to optimize the ratio of false positives for a Bloom filter capable of storing a given number of keys.

1. Why there are no false negatives …

In the simplest version of Bloom filters, deletion is not allowed. This means that if a bit is flipped when a key is stored, it will never ever be set to 0 again.

At the same time, the output of the hash functions is deterministic and constant in time.

TIP Remember to pay attention and abide by these properties in your imple­mentation if you need to serialize Bloom filters and deserialize them later.

So, if a lookup finds out that one of the bits associated with a key X is set to 0, then we know for sure that X was never added to the filter; otherwise, all the bits to which X is hashed would be 1.

2. … But there are false positives

The reverse, though, doesn’t hold up, unfortunately! An example will help you under­stand why. Suppose we have a simple Bloom filter with 4 bits and 2 hash functions. Ini­tially our buffer is empty:

B = [0, 0, 0, 0]

First, we insert the value 1 into the Bloom filter. Assume we had chosen our hash func­tions such that h0(1) = 0 and h1(1) = 2, so the key 1 maps to indices {0,2}. After updating it, our buffer now looks like this:

B = [1, 0, 1, 0]

Now we insert the value 2, and it turns out that h0(2) = 1 and h1(2) = 2. Therefore, we now have:

B = [1, 1, 1, 0]

Finally, suppose that h0(3) = 1 and h1(3) = 0. If we perform a lookup for the value 3 after those two insertions, both bits at indices 1 and 0 will have been set to 1 even if we never stored 3 in our instance of the filter! Therefore, 3 would give us a false positive.

On the other hand, if we had a different mapping from the hash functions, for instance, h0(3) = 3 and h1(3) = 0, since the fourth bit hadn’t been set yet (B[0]==0), a lookup would have returned false.

Of course, this is a simplistic example intentionally crafted to prove a point: false positives are possible, and they are more likely to happen if we don’t choose the parameters of our Bloom filter carefully. In section 4.10, we’ll learn how to tune these parameters, based on the number of elements we anticipate we will store and on the precision we need.

3. Bloom filters as randomized algorithms

Appendix F introduces the taxonomy of randomized algorithms, and in particular the dichotomy between Las Vegas and Monte Carlo algorithms.

If you don’t have a clear idea of the difference between these two classes or what randomized algorithms are, this would be a good time to check appendix F.

Once those definitions are clear to you, it should not be hard to make an educated guess: Which category do our Bloom filters belong to?

As you might have already figured out, Bloom filters are an example of a Monte Carlo data structure. Method contains, the algorithm that checks if a key is stored in a Bloom filter, in particular, is a false-biased algorithm. In fact, it might return true for some keys never added to the filter, but it always correctly returns true when a key was previously added, so there are no false-negatives (that is, every time it answers false, we are sure the answer is correct).

NOTE Bloom filters are also a tradeoff between memory and accuracy. The deterministic version of a Bloom filter is a hash set.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *