The dictionary problem and Alternatives to implementing a dictionary

1. The dictionary problem: Keeping track of things

Let’s go over a hypothetical scenario: you work for a company that is large enough to maintain its own email service. This is a legacy service, providing basic features only. After the last reorganization, the new CTO[1] decides that you need to reinvent the email service to be aggressive on the market segment, and she puts your new manager in charge of the product redesign.

They want a brand-new, modern client with features like a contacts list and cool tricks. For instance, when you add a new recipient to an email, your application should check whether it’s already in the contacts list, and if it’s not, a popup (like the one shown in figure 4.1) should appear asking you if you’d like to add the new recipient.

And, needless to say, you are the lucky one who gets to be responsible for imple­menting this feature.

Of course, the resources allocated for the project are scarce, so you are only going to do a refactoring of the client, while on the server side you’ll have to make do with legacy code and legacy services running on proprietary machines.

Calling the database every time you need to check an email address against your con­tacts list is out of the question: you are stuck with this legacy machine that can’t scale up, and you didn’t get funds for refactoring and scaling it out.[2] Your DB could not support more than a few calls per second, and your management’s projection is in the hundreds of emails written per second (they are very optimistic, and a bit reckless, but let’s not worry about that for now!). Your first instinct would be to ask for a remote dis­tributed cache, like Memcached, Cassandra, or Redis. A roundtrip to the cache server alone would take in the range of a hundred milliseconds, best case scenario, which would be good. But neither spinning up a new server for the cache, nor buying it as a cloud service is feasible for you, budget-wise.

In the end, you decide you only have one way to solve this: asynchronously get your contacts list when you log in (or, more lazily, the first time you click on Compose during the current browser session), save the contacts list in your web page’s session storage space, and check the local copy of that data every time you look for existing contacts.

So far for the application design, figure 4.2 shows a possible architecture for our minimum viable product. Now you just need an efficient way to browse a contacts list and check if an email belongs to it.

Looking through a list for a certain entry is a common problem in computer sci­ence, known as the dictionary problem.

Figure 4.2 A possible architecture for the “save new contact (with suggestion)” feature. On the client side, the web app receives the contacts list from the web server, and with those data creates a dictionary on the session storage. Whenever the user adds a recipient on an email, the web app checks the dictionary. If the contact is not in the list, a popup is shown to the user, who can then decide to save it. In that scenario, to save the new contact another HTTP call is made to the web server, and at the same time the dictionary is updated with the new value (although without going through the server).

2. Alternatives to implementing a dictionary

The name shouldn’t be surprising; it’s exactly like when you need to look up a word in a dictionary (and by dictionary I mean one of those old gigantic paper books that have been almost entirely replaced by online dictionaries and search engines) or even in a phone book (which had no better luck after the third computer revolution).

To recap, our contacts web app needs to

  • Download the list of contacts from a server
  • Create a local copy for fast lookup/storage
  • Allow looking up for a contact
  • Provide the option to add a new contact if lookup is unsuccessful
  • Sync with the server when a new contact is added (or an existing one modified)

What we really need is a data structure that is specialized in these kinds of operations; we need it to support fast insertion, and at the same time it needs to provide a way to look up an entry by value.

To be clear, when we use a plain array, we don’t have an efficient array’s method that tells us the index of an element X, nor an efficient (as in sublinear) method to tell us if an element is in the array or not. The only way we have to tell if an element is in the array is by going through all the array’s elements, although in a sorted array we could use binary search to speed up the search.

For example, we could store the strings [“the”, “lazy”, “fox”] in an array, and to search for “lazy”, we would skim through the whole array, element by element.

An associative array, instead, by definition has a native method that efficiently accesses the stored entries with a lookup by value. Usually this structure allows storing (key, value) pairs. For instance, we would have a list like < (“the”, article) , (“lazy” , adjective) ,   (“fox”, noun)>. We could search for “lazy” and the associative array would return adjective.

Another difference with regular arrays would be that the order of insertion in an associative array doesn’t matter; it’s not even well defined. That’s the price you pay to speed up lookup by value.

To really understand how efficiently we can solve this problem, we need to get into the details of implementations. Using the dictionary abstraction, however, allows us to discuss how to solve a problem (for instance, finding whether an email belongs to a list of contacts) without having to deal with the details of the data structure’s represen­tation, hence leaving us free to focus on the task itself.

3. Describing the data structure API: Associative array

An associative array (also referred to as dictionary, symbol table, or just map), is com­posed of a collection of (key, value) pairs, such that

  • Each possible key appears at most once in the collection.
  • Each value can be retrieved directly through the corresponding key.

The easiest way to grasp the essence of associative arrays is to think about regular arrays as a special case: the keys are just the set of indices between 0 and the size of the array minus 1, and we can always retrieve a value by providing its index, so that the (plain) array [“the”, “lazy”, “fox”] can be interpreted as a dictionary storing the associations (0, “the”), (1, “lazy”) and (2, “fox”).

Associative arrays generalize this concept, allowing keys to be drawn from virtually any possible domain.

With this API defined, we can sketch a solution for our initial problem.

When users log in to their email, our client receives a list of contacts from the server and stores them in a dictionary that we can keep in memory (having so many contacts that it won’t fit the browser’s session storage would be an exceptional situa­tion even for an Instagram rock star!). If the user adds a new contact to our address book, we perform a call to insert on the dictionary; likewise, if users remove an exist­ing contact, we just keep the dictionary in sync by calling remove. Whenever a user writes an email and inserts a recipient, we first check the dictionary, and only if the contact is not in the address book do we show a popup asking users if they would like to save the new contact.

This way, we never do an HTTP call to our server (and in turn to the DB) to check whether a contact is on our address book, and we only read from the DB once on startup (or the first time during a session that we write an email).

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *