Bloom filters: Implementation of Reducing the memory for tracking content

Enough with the theory; now it’s time to once again get our hands dirty! As usual, in the next sections we’ll show pseudo-code snippets and comment the key sections. Triv­ial methods will be omitted. On the book’s repo on GitHub, you can also find an implementation with the full code along with unit tests.

1. Using a Bloom filter

Back to our contacts application: How would we use a Bloom filter to make it faster? Well, as mentioned, we need to use it as a dictionary, so we are going to create a new Bloom filter when our email application starts, retrieve all the contacts from the server, and add them to the Bloom filter. Listing 4.1 summarizes this initialization process.


Once we set up our directory application, we have two operations that we are mainly interested in: checking if a contact is on the list and adding a new contact to the directory.

For the former operation, shown in listing 4.2, we can check the Bloom filter, and if it says the contact has never been added, then we have our answer; the contact is not in the system. If, however, the Bloom filter returns true, then it might be a false posi­tive, so we need to contact the server to double-check.

For adding new contacts, we always have to sync to our permanent storage,[2] as shown in listing 4.3. Since this likely implies a remote connection through a network, there is a non-negligible probability that the call to the server fails; therefore, we need to han­dle possible failures and make sure the remote call succeeds before also updating the Bloom filter.

To be thorough, in real implementations we should also synchronize the access to the server and Bloom filter, using a locking mechanism (see chapter 7), and using a try-catch around the whole operation, rolling back (or retrying) if the call on the Bloom filter fails.

2. Reading and writing bits

Now, let’s move to the implementation of a Bloom filter, starting, as usual, with the helper methods that will give us the basic blocks to build the implementation of our API.

In particular, we need

  • Some way to read and write bits at any location in our filter’s buffer
  • A mapping between a key in input and the bits’ indices in the buffer
  • A set of deterministically generated hash functions that will be used to trans­form keys into a list of indices

If we are using Bloom filters to save memory, it wouldn’t make sense to store bits inef­ficiently. We will need to pack bits into the smallest integer type available in the pro­gramming language we choose; therefore, both reading and writing a bit forces us to map the index of the bit to be accessed into a couple of integers.

In modern programming languages, in fact, you can typically speed up these oper­ations by using fixed-size numerical arrays of primitive types and vector algebra. The price to pay is that when we get a request to access the i-th bit in the filter, we need to extract two coordinates from index i: the array element that is storing the i-th bit and the offset of the bit we need to extract with respect to that element.

Listing 4.4 shows what this means and how it can be computed.

Once we have those two indices, we can easily read or write any bit; it just becomes a matter of bitwise arithmetic. For instance, listing 4.5 shows the readBit method tak­ing care of the reading part.

Listing 4.6 shows the writing counterpart, method writeBit. You might be surprised to see that we don’t pass the value to write, but since (this version of) Bloom filter doesn’t support deleting elements, we can only write a 1; we never write zeros.

Let’s go through an example for readBit and writeBit. Suppose we have this buffer: B=[157, 25, 44, 204], with BITS_PER_INT=8.

We call readBit(B, 19); then we have: element==2, bit==3.

Therefore

  • bitsArray[element]              (evaluates to 44)
  • (1 << bit)                      (8)
  • bitsArray[element] & (1 << bit) (8)

And the returned value will be 1.

If, instead, we call writeBit(B, 15), then we have: (element==1, bit==7).

Therefore

  • bitsArray[element]              (evaluates to 25)
  • (1 << bit)                      (128)
  • bitsArray[element] | (1 << bit) (153)

And the buffer will be updated to B=[157, 153, 44, 204]

3. Find where a key is stored

To generate all the indices for the bits used to store a key, we go through a two-step process, described in listing 4.7.

Keep in mind that our ultimate goal is to transform a string into k positions, between 0 and m – 1.

First, we use two hash functions on strings very different from each other: murmur hashing and fnvl hashing. The chances that, for a given string, both of them produce the same result are beyond slim.

Then, for each of the k bits we have to store, we retrieve the corresponding hash function in our pool. For each position i between 0 and k-1 we have generated (on initialization) a double hashing function hi. The i-th bit will therefore be returned by hi(hM, hF), where hM is the result of murmur hashing on the input key and hF the result of fnv1 hashing.

Although the highest level of randomization would be obtained with a random seed for each run, we need to have a way to force a deterministic behavior both for tests and to recreate a Bloom filter that can make sense of a given buffer, in case the filter is serialized, or to support quick restart over failure. Therefore, we should also leave the option to pass the seed to the Bloom filter’s constructor.

4. Generating hash functions

In listing 4.7 we described how, in the key2Positions method, we pass an array of hash functions and use it to transform a key into a list of indices: the positions in the filter’s bits array where we store the key. Now it’s time to see in listing 4.8 how we ini­tialize these k hash functions needed to map each key (already transformed into a string) into a set of k indices, pointing to the bits that will hold the information about a key (stored vs not stored).

The set of functions will be created by using double hashing to combine the two arguments in k different ways. With respect to linear or quadratic hashing, double hashing will increase the number of possible hashing functions that we can obtain, from O(k) to O(k2). Still, this is far from the ideal O(k!) guaranteed by uniform hash­ing, but in practice it is close enough to have good performance (meaning a low colli­sion rate).

5. Constructor

Let’s move on to the public API, which will mirror the API for set that we defined in section 4.3.3. Let’s start from the constructor method.

As it happens most of the times, the task for our constructor is mostly boilerplate code to set up all the internal state of a Bloom filter. In this case, however, there is also some non-trivial math to work out in order to compute the resources to allocate in order for the container to live up to the accuracy required by the client.

Listing 4.9 describes the code for a possible constructor. Notice, in particular, at lines #5 and #8 how we compute respectively the number of bits and the number of hash functions needed to have a ratio of false positives within the tolerance specified by maxTolerance (and consequently, at line #9, the number of array elements needed to store the filter). Here we assume that we use an array whose elements are integers, and BITS_PER_INT is a system variable that gives us the size, in bits, of integers. Clearly, for those languages supporting multiple numerical types, we can also choose to have arrays of bytes, when available.

On creation, we need to provide only the maximum number of elements that the fil­ter will be expected to contain. If at any time we realize that we stored in the filter more than maxSize elements, the good news is that we won’t run out of space, but the bad news is that we can no longer guarantee the expected precision.

Speaking of which, we can pass an optional second parameter to set the expected accuracy. By default, the threshold for the probability of a false positive (maxTolerance) is set to 1%, but we can aim for better accuracy by passing a smaller value, or settle for worse accuracy as a tradeoff for a lower amount of memory needed by passing a higher value.

The last optional parameter is needed, as explained in the previous section, to force a deterministic behavior for the filter. When omitted by the caller, a random value is generated for the seed.

After validating the arguments received (omitted in listing 4.9), we can start setting a few base fields. Then comes the trickiest part; given the number of elements and the expected precision, compute how large our buffer needs to be. We use the formula described in section 4.10, but we need to verify that the size of the buffer is still safe to be held in memory.

Once we have the size of the buffer, we can compute the optimal number of hashes needed to keep the rate of false positives as low as possible.

6. Checking a key

We can now start composing the helper methods presented so far to build the Bloom filter’s API methods. One note of caution: we assume keys are going to be strings, but they can also be serializable objects. If that’s the case, you will need a consistent serial­ization function that turns equivalent objects (for instance, two sets containing the same elements) into the same string; otherwise, no matter how good your implemen­tation of the Bloom filter (or any other dictionary you might use) is, your application won’t work properly.

NOTE Data massaging and preprocessing are often as important, or more important, than the actual algorithms run on it.

With all the helper functions we already defined, checking the Bloom filter for a key becomes a piece of cake. We just need to retrieve the positions where the bits for the key would be stored, and check that those bits are all 1. To review the whole process, you can take a look at figure 4.4, and find the pseudo-code for the method in list­ing 4.10.

You might have noticed that in contains we check that the value returned by readBit isn’t 0. While it would technically suffice to check that the bit we read is equal to 1, this would force us to use an extra right-shift bitwise operation. If our bit is stored in the i-th element of the array as its j-th bit (from the right), in fact, we should in theory shift the result of our bitwise extraction process j bits to the right, or compare it with a mask made of a single 1 shifted j positions to the left. This way we don’t need to, and we can trim off a few milliseconds (per operation) from our implementation.

Also notice that the method takes an optional second parameter. The reason will become clear in the next section.

7. Storinga key

Storing a key is pretty similar to checking it, only we need a little extra effort to keep track of the number of elements added to the filter and to use write instead of read. Figure 4.5 shows a step-by-step example, putting together all the little pieces we have coded so far.

Note that in this implementation for insert, shown in listing 4.11, when we compute the size of the filter, we keep track of the number of unique elements added to the fil­ter, rather than the total number of times the add method is called.

This is because the precision of the filter would not be altered by adding the same key twice, thrice, or an infinite number of times. To all extents, it would count as if the key was added once.

There is a twist, though. If we add a new unique key x for which all bits’ indices clash with locations already set to 1, it will be treated as a duplicate key and the data structure’s size won’t be incremented. Again, this makes perfect sense if you consider that in that situation and before actually adding the new colliding key, a call to contains(x) would have returned a false positive anyway.

In listing 4.11 you can see the reason why, when we wrote method contains, we added the possibility to pass it an array with the pre-computed bit indices for the cur­rent key. Inside insert, we perform a read closely followed by a write. The operation to compute all the indices for the bits of a key can be expensive, so to avoid repeating this operation twice at a short distance, we need a way to pass its result to contains. At the same time, we don’t want to add this option to our API, since this is internal magic that clients don’t need to know about. If your programming language supports poly­morphism and private methods, restricting the optional parameter to a private ver­sion of contains (internally called by the public one) would be wise.

Another possible approach to save this duplicated effort could be having writeBit check whether the bit overridden was already set to 1 and return a Boolean to state if the bit’s value has changed. Then insert could check to see if at least one bit was flipped. This alternative implementation will be provided in our repo.[3] It’s up to you to decide which one you consider the cleanest.

Either way, striving to count unique keys added to the filter is going to be costly. Will the overhead be justified? That depends: if you don’t expect many duplicate keys added to your filter, then it’s probably not worth it. But just know that you need it to have an accurate estimate of the current probability to get a false positive.

8. Estimating accuracy

In fact, our last task is to provide a method to estimate the probability of a false posi­tive based on the current status of the filter; that is, on the number of elements cur­rently stored in the filter, compared to its maximum capacity.

As we will see in section 4.10, this probability is roughly

Listing 4.12 briefly illustrates the pseudo-code that computes this method.

This concludes the current section about how to implement a Bloom filter. Before we delve into the theoretical part explaining the mathematical basis of this data struc­ture, let’s first review some of the many applications for Bloom filters.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *