Bloom filters: Performance analysis and Estimating precision

1. Performance analysis

Before starting on Bloom filter analysis, I suggest a deep dive into metrics for classifi­cation algorithms in appendix F.

We have seen how and why Bloom filters work; now let’s see how efficient they are. First, we’ll examine the running time for the most important operations provided by a Bloom filter. Next, in section 4.10, we will tackle the challenge of predicting the preci­sion of method contains, given a Bloom filter with a certain structure (in particular, its size and the number of hash functions used will matter).

1.1. Running time

We have already hinted at the fact that Bloom filters can store and look up a key in constant time. Technically, this is only true for constant-length inputs; here we will examine the most generic case, when keys stored are strings of arbitrary length.

But let’s start from the beginning: the very construction of a Bloom filter. After­wards we’ll examine in detail insert and contains.

1.2. Constructor

Construction of a Bloom filter is pretty easy; we just have to initialize an array of bits, with all its elements set to 0, and generate the set of k hash functions. We have seen that the implementation also involves some computation, but it’s safe to mark that part as constant time.

Creating and initializing the array requires, obviously, O(m) time, while generating each of the hash functions typically requires constant time; hence O(k) time overall is needed to generate the whole set.

The whole construction can be ultimately finished in O (m+k) steps.

1.3. Storing an element

For each key to store, we need to produce k hash values, and flip a bit in each of the elements of the array indexed by those results.

We’ll make the following assumptions:

  • Storing a single bit requires constant time (possibly including the time needed for bitwise operations in case we use compressed buffers to save space).
  • Hashing a key X requires t(x)
  • The number of bits used to store a key doesn’t depend on the key’s size, or on the number of elements already added to the container.

Given these assumptions, the running time for insert(X) is 0(k*T(|x|)). For each of the k hash functions we need, in fact, to generate a hash value from the key and store a single bit. If keys are numbers—for instance, integers or doubles—then |x|=1 and T(|x|) = 0(1), meaning that we can typically generate a hash value in constant time.

If, however, our keys are of variable length—for example, strings—then computing each hash value requires linear time in the length of the string. In this case, T(|x|) = 0(|x|), and the running time will depend on the length of the keys we add.

Now, suppose we know that the longest key will have at most z characters, where z is a constant. Remembering our assumption about the length of keys being indepen­dent of anything else, we can then still argue that running time for insert (X) is 0(k*(1+z)) = 0(k). Thus, this is a constant time operation, no matter how many ele­ments have been already added to the container.

1.4. Looking up an element

The same considerations hold true for the lookup of a key: we need to transform keys into a set of indices (done in time 0(z*k)), and then check each bit at those indices (0(k) time for all of them). So, lookup is also a constant time operation, under the assumption that keys’ lengths are bounded by a constant.

2. Estimating Bloom filter precision

Before we start, we need to fix some notations and make a few more assumptions:

  • m is the number of bits in our array.
  • k is the number of hash functions we use to map a key to k different positions in the array.
  • Each of the k hash functions we use is independent from the others.
  • The pool of hash functions from which we extract our k functions is a universal hashing set.

If these assumptions hold, then it can be proven that a good approximation for the false positive probability after n insertions is given by

where e is Euler’s number, the base of natural logarithms.

Now we have a formula to estimate the probability of a false positive! That’s nice per se, but what’s even better is that it allows us to tune m and k, our Bloom filter parameters. This way we can decide the size of the Bloom filter’s buffer and the num­ber of hash functions needed to have optimal precision.

There are three variables in the formula for p (n, m, k):

  • m, the number of bits in the buffer
  • n, the number of elements that will be stored in the container
  • k, the number of hash functions

Of those three, k is the one that seems less meaningful to us. Or, from another point of view, it is the one less coupled with our problem. In fact, n is probably going to be a variable that can we can estimate, but we can’t fully control. We might need to store as many elements as we get requests for, but most of the time we can anticipate the vol­ume of requests and make estimates pessimistic, to be on the safe side.

We might be limited in the choice of m as well, since we could have memory con­straints; maybe we can’t use more than m bits.

For k, instead, we have no constraint on it, and we can tune it to get optimal preci­sion, meaning, as we explained in appendix F, minimal probability of false positives.

Luckily finding the optimal value for k, given m and n, isn’t even that hard—it’s just a matter of finding the minimum for the function:

Note that n and m are constants in the formula.

We will go into the details of finding the minimum for f(k) in the next section; for now (or if you are more interested in the result), just know that the optimal value is

Now that we have a formula for k*, we can substitute it in our previous formula, the one for p(n, m, k), and after a some algebraic manipulation, we obtain an expres­sion for the optimal value for m (let’s call it m*):

This means that if we know in advance the total number n of unique elements that we will insert in the container, and we set the probability p of a false positive to the largest value that is acceptable, then we can compute the optimal size of the buffer (and the number of bits per key we need to use) in order to guarantee the desired precision.

Two aspects are significant when looking at the formulas we derived:

  • The size of the Bloom filter’s buffer is proportional to the number of elements being inserted.
  • At the same time, the required number of hash functions only depends on the target false positive probability p (you can see this by substituting m* into the for­mula for k*, but don’t worry, we’ll discuss the math in the next section).

2.1. Explanation of the false-positive ratio formula

In this section we’ll explain in more detail how the formulas to estimate a Bloom fil­ter’s precision are derived; first, let’s see how we obtain the estimate for the false prob­ability ratio.

After a single bit has been stored in a Bloom filter with a capacity of m bits, the probability that a specific bit is set to 1 is 1/m; then the probability that the same bit is set to 0 after all k bits used to store an element have been flipped (assuming the hash functions will always output k different values for the same input) is therefore

If we consider the events of flipping any specific bit to 1 as independent events, then after inserting n elements, for each individual bit in the buffer the probability that the bit is still 0 is given by

To have a false positive, all the k bits corresponding to an element V must have been set to 1 independently, and the probability that all of those k bits are 1 is given by

Which is, lo and behold, the probability formula we gave at the beginning of this section.

At this stage, as we mentioned in the last section, we can consider n and m as con­stants. It makes sense because in many cases we know how many elements we need to add to the Bloom filter (n) and how many bits of storage we can afford (m); what we would like to do is trade performance for accuracy by tuning k, the number of (uni­versal) hash functions we use.

This is equivalent to finding the global minimum of function f, defined as

If you know some calculus, you probably have already guessed that we need to com­pute the derivatives of f with respect to k. (If you don’t know calculus, don’t worry; you can skip the next few lines and resume on the next page, where we will be using the result of this computation.)

To make our job easier, we can rewrite f by applying natural logarithm and expo­nentiation, so that we get

This function is minimal when its exponent is minimal, so we can define function g

and compute the derivatives of g instead, which is easier to work with.

The first order derivative of g(k) is

and it becomes equal to 0 for k=ln (2)*m/n.

To make sure this is a minimum for the function, we still need to compute the sec­ond order derivative and check that it returns a negative value when computed on the zero of g’:

We’ll skip this step for the sake of space, but you can double check that it’s indeed true.

It’s worth noting that

  • The formula for k gives us a single, exact value for the optimal choice of the number of hash functions.
  • k must obviously be an integer, so the result needs to be rounded.
  • Larger values for k mean worst performance for insert and lookup (because more hash functions need to be computed for each element), so a compromise with slightly smaller values of k can be preferred.

If we use the best value for k as computed previously, this means that the rate of false positives f becomes

By replacing the value k in the formula for p (n,m,k), we can get a new formula that links the number of storage bits to the (maximum) number of elements that can be stored, independently of k (the value for k can be computed later) to guarantee a false-positive probability smaller than a certain value p:

Taking the base 2 logarithm of both sides

and then solving for m finally gives us

This means that if we know in advance the total number n of unique elements that we will insert in the Bloom filter, and we set the false positive probability p to the largest value that is acceptable, then we can compute the optimal size of the buffer that guar­antees the desired precision. We will also have to set k accordingly, using the formula we derived for it:

For instance, if we would like to have a precision of 90%, and so at most 10% false pos­itives, we can plug in the numbers and get

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *