Bloom filters: Improved variants

Bloom filters have been around for almost 50 years, so it’s natural that many variations and improvements have been proposed in the meantime. Let’s examine some of them, with a focus on those that improve accuracy.

1. Bloomier filter

As we hinted at, Bloom filters corresponds to faster, lighter versions of HashSets, since they can only store whether a key is present/absent.

The leaner counterpart of HashTables has been introduced only recently: Bloomier filters allow associating values to keys. When the key/value pair has actually been stored in the Bloomier filter, then the value returned is always correct. There are still false positives; that is, keys that were not actually stored in the data structure but for which a value is returned.

2. Combining Bloom filters

By storing the same key in two or more different Bloom filters, possibly with different buffer sizes, but most importantly with a different set of hash functions for each one, we can arbitrarily lower the probability of false positives.

Of course, this is not for free, so the space needed grows proportionally, and the time needed to store or check for a key in theory would double as well.

At least for the running time, though, there is a silver lining: each of the compo­nent filters can actually be queried in parallel on a multicore hardware! So, besides maintaining the O(k) constant time bound, even actual implementations could possi­bly be as fast as regular Bloom filters (in other words, the constant factor remains approximately the same).

The way this structure works is the following: for insert, each of the components stores the key independently. For calls to contains, the answer is the combination of all answers: true is returned only if all the components return true.

What’s the accuracy of such an array of filters? It can be proven that the precision of a single Bloom filter using m bits is the same as the precision of j Bloom filters using m/j bits each.

However, when using a parallel version of the ensemble algorithm, the running time can be just a fraction of the original one, 1/j.

3. Layered Bloom filter

A layered Bloom filter (LBF) also uses multiple Bloom filters, but organized in layers. Each layer is updated for a key only after the previous layer has already stored the same key. Layered Bloom filters are normally used to implement a counting filter: an LBF with R levels can count up to R insertions of the same key. Often deletion is also supported.

Each call to contains starts by checking the layers, from the closest to the deepest, and returns the index of the last layer where the key was found, or -1 (or, equivalently, false) if the key is not stored in the first layer.

When storing a key, the insert method stores it in the first layer for which con­tains returns false.

Assuming that each layer has a false-positive ratio equal to PF, and that all layers use different sets of hash functions, then if an element has been stored c times on the filter

  • The probability that contains returns c+1, a counter 1 unit larger than the real number of times a key was stored, is approximately PF.
  • The probability of returning c+2 is PF2.
  • Similarly, for c+d, the probability becomes PFd.

Those, however, are approximate (optimistic) estimates, because the assumptions about the universality and independence of the hash functions are hard to guarantee. To compute the exact probabilities, we would need to take into account the depth and the number of layers.

On an LBF with l layers, each using k bits per key, the running time for both insert and contains becomes O(L*k), but since both L and k are predetermined constants, it’s still equivalent to O(1) and independent of the number of items added to the container.

4. Compressed Bloom filter

When used in web caches, the main issue with Bloom filters is not with the RAM they use, but rather, since these filters need to be transferred between proxies, it’s the size of data transmitted on the network.

While at first you might think that’s a moot point, if we plan to compress the Bloom filters before transmitting them through a network, then it does actually have practical relevance. It turns out we can optimize the values of the filter’s parameters, which in turn regulate the size of the bit array, and as a result get a larger uncom­pressed Bloom filter that could be compressed more efficiently.

That’s the idea behind compressed Bloom filters, where the number of hash functions k is chosen in such a way that the number of bits with value 1 in the bits array is kept below m/3, where m is the size of the array. As a consequence, at least 2/3 of the buffer’s bits will always be set to 0, and we can take advantage of this fact to com­press the bit array more efficiently.

Each proxy would then have to decompress the Bloom filter before looking up ele­ments. Clearly now we have another target to optimize. We need to find a compromise between the uncompressed size of the bits array, m, and its compressed size, which we want to keep as small as possible.

The uncompressed size, in fact, determines the running time of lookups (through the formulas we developed in section 4.10), while the compressed size determines the transfer ratio.

Since router tables will be updated periodically, decompressing the whole filter could be a heavy overhead for nodes. A good compromise is breaking the filter into pieces and compressing each piece independently. This makes the overall compres­sion rates slightly worse, but when updates are frequent (compared to lookups), it reduces the overhead for decompression even more, since each proxy, between two updates, won’t decompress the whole bits array, but only the chunks it needs.

5. Scalable Bloom filter

A scalable Bloom filter is yet another combination of several Bloom filters, working sim­ilarly to a layered Bloom filter. Only this time the different layers have increasing capacity (and hence smaller false positives ratio). This allows the container to adapt dynamically to the number of elements stored and keeps the probability of false positives low.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *