Bloom filters: Concrete data structures

So far, so good with the theory, but of course, implementing associative arrays to be used on real systems is a completely different thing.

In theory, if the domain (the set of possible keys) is small enough, we can still use arrays by defining a total ordering on the keys and using their position in the order as the index for a real array. For instance, if our domain is made of the words {“a”, “terrible”, “choice”}, we could sort keys lexicographically, and then we would store the values in a plain string array; for instance, {“article”, “noun”, “adjective”}. If we need to represent a dictionary that only contains a value for key “choice” , we could do that by setting the values corresponding to missing keys to null: {null, “noun”, null} .

Usually, however, this is not the case, and the set of possible values for the keys is large enough to make it impractical to use an array with an element for any possible key’s value; it would just require too much memory, which would remain largely unused.

To overcome this issue with the memory, we present two naive implementations, and three of the most widely used alternatives.

1. Unsorted array: Fast insertion, slow search

Even if you have never seen one of those paper dinosaurs (a dictionary), you ought to at least be familiar with print books, like this one (unless, of course, you bought the e-book version!).

Suppose you need to look for a specific word in this book—maybe a name—like Bloom, and note all the places where it occurs. One option you have is to go through the book starting from the first page, word by word, until you encounter it. If you need to find all the occurrences of the word Bloom, you will have to go through the book from cover to cover.

A book taken as a collection of words, in the order they are printed, is like an unsorted array. Table 4.1 summarizes the performance of the main operations needed using this approach with unsorted arrays.

Unsorted arrays have the advantage of no extra work needed on creation, and add­ing a new entry is pretty easy, provided you have enough capacity.

2. Sorted arrays and binary search: Slow insertion, fast(-ish) search

That’s not exactly practical, as you can imagine. If, after going through the whole book, you need to search a second word, like “filter,” you would have to start all over and do a second pass through the hundreds of thousands of words in the book. That’s why most books have what’s called an index, usually toward the end of the book; there you can find an alphabetically ordered list of the most uncommon words (and names) used in the book. Common words don’t make it onto this list because they are used too frequently (articles like “the” and “a” are probably used on every single page) to be worth being listed, since the value of finding places where they are used would be minimal. Conversely, the rarer a word is in English (and names are the perfect exam­ple here), the greater importance it has when it’s used in your text.[1]

So, you can check the index and look for the name Bloom. Being in lexicographi­cal order, you could go through it from the start until you catch the word you are look­ing for; it shouldn’t be too long with Bloom. You wouldn’t be that lucky with terms like hashing or, even worse, tree, which will be toward the end of the index.

That’s why, naturally, we do lookups in sorted lists by subconsciously using binary search:[2] with a phone book, you open it at a random page around the middle (or closer to the beginning or end, if you have an idea where the name you are looking for might be), then look at the first letter of the first word on that page, and jump either before or after the current page depending on what you are looking for. For example, if you are still looking for Bloom, and you open a phone book to a page where the first surname is Kurtz, then you know you can discard every page after that, and look only at the pages before. You randomly open another page (to the left of the one with Kurtz) and the last name on that page is Barrow; then you know Bloom will be on a page after the one with Barrow and before the one with Kurtz.

Going back to our problem with the contact list, one approach could be sorting your contacts and searching them using binary search.

As you can see from table 4.2, in terms of running time, both the initial cost (to sort the list) and the cost for adding a new entry are pretty high. Also, linear extra memory might be needed if we need to make a copy of the original list, to preserve the original order.

3. Hash table: Constant-time on average, unless you need ordering

We introduced hash tables and hashing in appendix C. The main take away for hash tables is that they are used to implement associative arrays where the possible values to store come from a very large set (for instance, all possible strings or all integers), but normally we only need to store a limited number of them. If that’s the case, then we use a hashing function to map the set of possible values (the domain, or source set) to a smaller set of m elements (the codomain, or target set), the indices of a plain array where we store the values associated to each key. (As we explained in appendix C, we get to decide how big m is depending on some considerations on the expected perfor­mance.) Typically, the set of values in the domain is referred to as keys, and the values in the codomain are indices from 0 to m-1.

Since the target set of a hashing function is typically smaller than the source set, there will be collisions: at least two values will be mapped to the same index. Hash tables, as we have seen in chapter 2, use a few strategies to resolve conflicts, such as chaining or open addressing.

The other important thing to keep in mind is that we distinguish hash maps and hash sets. The former allows us to associate a value[3] to a key; the latter only to record the presence or absence of a key in a set. Hash sets implement a special case of dictionary, the Set. With respect to our definition of dictionary as an abstract data structure, given at the beginning of this section, a set is a specialization of a dictionary, where the value type is set to Boolean; the second parameter to insert becomes redundant, as the value associated with a key in the hash set will be implicitly assumed to be true.

As we explain in appendix C, all operations in a hash table (and hash set) can be per­formed in amortized o(1) time.

4. Binarysearch tree: Every operation is logarithmic

Binary search trees (BSTs) are another old acquaintance of ours; we met them in chapters 2 and 3, and they are also discussed in appendix C.

A BST is a special kind of binary tree that can store keys on which it defined a total ordering: this means that for each pair of keys, it must be possible to compare them and decide which one is smaller, or if they are equal. A total ordering benefits from reflexive, symmetric, and transitive properties.

Ordering relations

Given a set S on which we defined an ordering relation <, this relation is a total order­ing if, for any three keys x, y, z, the following properties hold:

Reflexive: x < x

Symmetric: if x < y, then y < x

Transitive: if x < y and y < z then x < z

BSTs use these properties to make sure that the position of a key in the tree can be determined just by looking at a single path from the root to a leaf.

When we insert a new key, in fact, we compare it to the tree’s root. If it’s smaller, we take a left turn, traversing the root’s left subtree; otherwise, we go on the right sub­tree. At the next step, we repeat the comparison with the subtree’s root, and so on until we get to a leaf, and that’s exactly the position where we need to insert the key.

If you remember what we saw in chapter 2 regarding heaps (or if you had a chance to refresh your memory, in appendix C), all operations in a BST take time proportional to the height of the tree (the longest path from root to a leaf). In particular for balanced BSTs, all operations take O(ln(n)) time, where n is the number of keys added to the tree.

Of course, compared to the O(1) amortized running time of hash tables, even bal­anced BSTs don’t seem like a good choice for implementing an associative array. The catch is that while their performance on the core methods is slightly slower, BSTs allow a substantial improvement for methods like finding the predecessor and successor o key, and finding minimum and maximum: they all run in O(ln(n)) asymptotic time for BSTs, while the same operations on a hash table all require O (n) time.

Moreover, BSTs can return all keys (or values) stored, sorted by key, in linear time, while for hash tables you need to sort the set of keys after retrieving it, so it takes O(M + n*ln(n)) comparisons.

Now that we have described the basic data structures most commonly used to implement dictionaries, it feels like a good time to recap what we have described so far. Table 4.3 gathers the running times of the main operations on the possible imple­mentations of dictionaries we mentioned.

Looking at table 4.3, it’s evident that if we don’t have to worry about any operation that involves the order of the element, or the order they were inserted, then the amor­tized time of hash tables is the best. If n~=M (and there are approximately as many buckets as elements), then a hash table can perform insertion, removal, and look-up in amortized constant time.

5. Bloom filter: As fast as hash tables, but saves memory (with a catch)

We haven’t officially met this data structure yet in the book, but there is a good chance you have already heard of Bloom filters. They are a data structure named after Burton Howard Bloom, who invented them in the 1970s.

There are three notable differences between hash tables and Bloom filters:

  • Basic Bloom filters don’t store data; they just answer the question, is a datum in the set? In other words, they implement a hash set’s API, not a hash table’s
  • Bloom filters require less memory in comparison to hash tables; this is the main reason for their use.
  • While a negative answer is 100% accurate, there might be false positives. We will explain this in detail in a few sections. For now, just keep in mind that some­times a Bloom filter might answer that a value was added to it when it was not.
  • It is not possible to delete a value from a Bloom filter.[4]

There is a tradeoff between the accuracy of a Bloom filter and the memory it uses. The less memory, the more false positives it returns. Luckily, there is an exact formula that, given the number of values we need to store, can output the amount of memory needed to keep the rate of false positives within a certain threshold. We’ll go into the details for this formula in the advanced sections toward the end of the chapter.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *