Use case: Memory is not enough (literally)

Let’s stop for a minute and recap what we have discussed so far:

  • We have a complex problem which has a lot of intermediate results that can internally be reused frequently.
  • To exploit this fact, we need a mechanism to remember the intermediate results so that we compute them only once (or at least as few times as possible).

That seems straightforward, right? For our example, it probably is: we have a very small number of possible entries (at most 100), and for each one we only store the sentiment computed. It seems likely that we would be OK with a very small amount of memory to store our cache.

What do you think would happen, however, if we needed to create a cache for the data gatherer in figure 7.2, storing all the messages across the major social networks mentioning a company in their raw format? Consider that there could be millions of those, for the “coolest” companies, and each one would on average require a few kilo­bytes (assuming we store media as links or don’t store them at all). Even with just a hundred companies, it could be on the order of a few gigabytes.

Likewise, consider how an application like Facebook would work. When you try to access your wall, an algorithm computes the best posts to show you, based on your friends’ walls, your preferences, the pages you follow, and so on.

It’s a pretty complex and resource-consuming algorithm, so you want to cache its results as much as possible, since you probably don’t need to recompute the feed if the same user accesses their wall after one or five minutes (or maybe you have an incremental update mechanism that only surfaces new posts when they are pub- lished—but that’s another story and slightly off the point).

Now consider this caching mechanism for a billion walls: even if we store just the top 50 posts per wall, and for each post we just store its ID (typically a few bytes), we still need a terabyte for the cache.

These examples show how we can easily reach a size, for the cache, that’s hard or impossible to keep in the RAM. It would still be possible to use NoSQL databases, or distributed caches, such as Memcached, Redis, or Cassandra, to name a few, but even there, the more entries we store, the slower it will be checking the cache. This will be clearer when we discuss the implementation of a typical cache.

We can also imagine a situation where we keep receiving different inputs that would need new entries in the cache. As you can imagine, we can’t add new entries indefinitely to a cache: being a finite system, there will always be a point when an infinite number of entries will saturate it.

Therefore, once we have filled the cache, if we want to add new entries, we need to start purging some of the existing ones.

The question is, of course, which entries should we purge? The somehow uncanny answer is that it would be ideal to remove the ones that won’t be requested anymore, or at least those that will be requested fewer times.

Unfortunately, as much as we’ve gotten better at forecasting, despite the amazing progress in AI we have seen, even computers don’t have any psychic superpower, so we can’t predict exactly which elements we are going to need the most and which ones the least.

Spoiler alert: there are some assumptions that we can use to try to make educated guesses. One reasonable assumption is that if an entry hasn’t been accessed for a long time, it has less value than an entry that was recently accessed. Depending on the size of the cache and on the time elapsed until the entry was last accessed, we could infer that the oldest entry is somehow stale, and it’s likely it won’t be needed soon or ever again.

On the other hand, depending on the context, even the opposite could be true: that entries that haven’t been accessed for a long time will be needed soon. In that case, though, removing the fresher items usually doesn’t work very well, and some other criteria must be found.

If we only look at the timestamp of last access, though, we discard other useful information. For example, if an entry was accessed several times, but not for the last few minutes, then all newer entries, even if only accessed once, will be considered more valuable. If you think about it, if an entry is only required once, it means it has been computed but never read from the cache, so caching it has not yet brought any advantage. After some time, if that entry is not yet accessed, the utility of keeping it becomes questionable. So, a different approach could be assigning a value to entries based on how often they are accessed.

We will discuss these recipes in detail in the next few sections, while we also detail the implementation of these caches.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *