Bloom filters: Applications of Reducing the memory for tracking content

I’d like to let you think about this consideration: it is pretty likely that some software you are using right now is leveraging Bloom filters. In fact, if you are reading the digi­tal version of this book, then it’s 100% sure that its download leveraged Bloom filters, because it’s common for internet nodes to use them for their routing tables.

1. Cache

With caching we refer to the process of storing some information retrieved on a fast storage system A, in order to have it ready in case we need to read it again in the near future. The data might have been previously retrieved from a slow(er) storage system B or be the result of a CPU-intensive computation.

For web applications, scalability is a major concern, probably the one aspect that is most debated in design reviews for any product that aspires to become viral.

One of my favorite books tackling design and scalability, Scalability Rules (by Martin L. Abbot and Michael T. Fisher, Addison-Wesley Professional, 2016), has a chapter titled “Use Caching Aggressively,” where it is shown how caching at all levels of a web app is fundamental to allowing its scaling.

Cache is often the only thing saving databases from literally catching fire—or at least crashing. But even your laptop has several levels of caching, from fast L1-cache inside its CPU to in-memory cache used to process big files. Your very operating sys­tem caches memory pages in and out of RAM, swapping them to disk, in order to have a larger virtual address space and give you the impression that it has more memory available than is actually installed on your machine.

In other words, caches are one of the foundations of modern IT systems. Of course, since fast storage is costlier, there is only a limited amount of it, and the major­ity of the data will have to be left out of the cache.

The algorithm used to decide what data stays in the cache determines the behavior of the cache and the rate of cache hit (when the data searched for is already in the cache) and miss. Some of the most used algorithms are LRU (Least Recently Used), MRU (Most Recently Used), and LFU (Least Frequently Used). Chapter 7 gets into the details of these algorithms, but for now it is enough to note that they, as well as many other cache replacement policies, all suffer from “one-hit wonders.” In other words, they struggle with objects, memory locations, or web pages requested just once, and never again (in the average lifetime of the cache). This is particularly common for routers and Content Delivery Networks (CDNs), where an average of 75% of the requests for a node are one-hit wonders.

Using a dictionary to keep track of requests allows us to only store an object in cache when it’s requested for the second time, filtering out one-hit wonders and improving the cache hit ratio. Bloom filters allow performing such lookups using amortized constant time operations and with limited space, at the cost of accepting some painless false positives. For this application, the only result of false positives, however, would be a tiny reduction of the cache performance gain we get by using the Bloom filter in the first place (so, no harm done).

2. Routers

Modern routers have limited space and, with the volume of packets they process per second, they need extremely fast algorithms. They are thus the perfect recipient for Bloom filters, for all those operations that can cope with a small rate of errors.

Besides caching, routers often employ Bloom filters to keep track of forbidden IPs and to maintain statistics that will be used to reveal DoS (Denial of Service) attacks.

3. Crawler

Crawlers are automated software agents scanning a network (or even the entire web) and looking for content, parsing, and indexing anything they find.

When a crawler finds links in a page or document, it is usually programmed to fol­low them and recursively crawl the link’s destination. There are some exceptions: for instance, most file types will be ignored by crawlers, as will links created using <a> tags with an attribute rel=”nofollow”.

TIP It is actually recommended that you mark in this way any anchor with a link to an action having side effects. Otherwise, search engines’ crawlers, even if they respect this policy, will cause unpredictable behavior.

What can happen is that if you write your own crawler and you are not careful, it might end up in an endless loop between two or more pages with mutual links (or chain of links) to each other.

To avoid such loops, crawlers need to keep track of pages they already visited. Bloom filters, again, are the best way to do so, because they can store URLs in a com­pact way and perform checking and saving of the URLs in constant time.

The price you pay here for false positives is a bit higher than for the previous examples, because the immediate result will be that the crawler will never visit a URL that caused a false positive.

To overcome this issue, it is possible to keep a complete list of the URLs visited in a proper dictionary (or another kind of collection) stored on disk, and if and only if the Bloom filter returns true, double-check the answer in the dictionary. This approach doesn’t allow any space saving, but it provides some savings on the execution time if there is a high percentage of one-hit wonders among the URLs.

4. IO fetcher

Another area where Bloom filter-based caching helps is reducing the unnecessary fetching/storage of expensive IO resources. The mechanism is the same as with crawl­ing: the operation is only performed when we have a “miss,” while “hits” usually trig­ger a more in-depth comparison (for instance, on a hit, retrieving from disk just the first few lines or the first block of a document, and comparing them).

5. Spell checker

Simpler versions of spell checkers used to employ Bloom filters as dictionaries. For every word of the text examined, a lookup on a Bloom filter would validate the word as correct or mark it as a spelling error. Of course, the false positive occurrences would cause some spelling error to go undetected, but the odds of this happening could be controlled in advance. Today, however, spell checkers mostly take advantage of tries: these data structures provide good performance on text searches without the false positives.

6. Distributed databases and file systems

Cassandra uses Bloom filters for index scans to determine whether an SSTable has data for a particular row.

Likewise, Apache HBase uses Bloom filters as an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. This in turn boosts the overall read speed, by filtering out unnecessary disk reads of HFile blocks that don’t contain a particular row or row-column.

We are at the end of our excursus on practical ways to use Bloom filters. It’s worth mentioning that other applications of Bloom filters include rate limiters, blacklists, synchronization speedup, and estimating the size ofjoins in DBs.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *