Spell-check: Efficient string search

But first, let’s introduce the problem that we will solve in this chapter: spell-checking.

Not that the problem really needs any introduction, right? We would like to have a piece of software that can take words as inputs and return true or false, respectively, whether the input is a valid English word[1] or not.

That’s a bit of a vague description that leaves the door open to many possible (and possibly inefficient) solutions; we need more context to clarify our requirements. Let’s assume that we are developing a new client for a social network on a client with low resources (for instance, on a mobile OS) and we need to add a live spell-checker that produces the classic red wavy underline below a misspelled word. Every time we type a word (and every time we type a word separator, like a space, commas, and so on) we need to check whether there is a typo.

Due to scarce resources, we need our spell-checker to be fast and lightweight; we need to reduce its impact as much as possible, both in terms of CPU and memory usage.

But we would also like our spell-checker to be able to learn—for instance, so users can add their names or the name of their town/club/favorite artists and so on.

1.  A prncess, a Damon, and an elf walkz into a bar

Have you noticed the typos in this section’s title? Typos are always around the corner when you need to send a message in a hurry (regardless of the medium used) and unfortunately, once these messages are sent out, there is no going back—you can’t edit them anymore.

That’s why a spell-checker that clearly highlights typos comes in handy, and today it’s often included natively in browsers.

In the end, the design of our spell-checker is pretty simple: it is a wrapper around a dictionary, and the client’s method that checks spelling just calls the container contains method, and if the result is a miss, adds the visual feedback to show the error.

In turn, the design of our container’s API is also simple. It’s a generic container supporting search, like the API of binary search trees or Randomized Treaps (described in section 3.4).

We can already think about how to implement this container with the tools that we learned. For instance, if we knew that we had to support fast lookups on a static set, then we would have chosen a hash table or, if we could trade saving some memory for a certain loss of precision, we could even resort to a Bloom filter. Since the require­ment is maintaining an open, dynamic set, however, the one data structure providing the best compromise for all the operations would be a tree.

A simple binary tree could, of course, be enough to support all the operations pro­vided by dictionaries. Figure 6.1 shows a possible representation of what these trees could look like, where we chose to show only a small part of the subtree containing a few similar words (we’ll see why this is relevant in a few lines).

How fast can operations on such a tree be? Assuming the tree is balanced, its height will be logarithmic in the number of words it contains, so for each call to contains, insert, remove, and so on, we’ll need to traverse on average (and at most) O(log(n)) nodes.

So far, however, when we’ve analyzed trees, we’ve assumed their keys were either integers or could be checked in constant time and require a constant space.

For generic strings, this assumption is not realistic anymore. Each node will need to store a string of unbounded length, so the total memory needed to store the tree will be the sum, for each node, of all the keys’ lengths:

If we assume that the average length of the strings held by the tree is L, the expected value for S(n), the space needed to store the tree, is proportional to n*L. If the maxi­mum length for strings is denoted with m, then S(n) = O(n*m) is a strict upper bound for the worst case.

Likewise, if we look at the running time of the search method, we see that we can’t ignore the strings’ lengths anymore. For instance, a call to search(“antic”) on the tree in figure 6.1 would start from the root and compare “antic” to “anthem”, which would require at least four characters to be compared before verifying the two words are not the same. Then we would move to the right branch and again compare the two strings, “antic” and “antique” (five more character-to-character comparisons), and since they don’t match, traverse the left subtree now, and so on.

Therefore, a call to search would require, worst case, T(n) = O(log (n) *m) com­parisons.

2. Compression is the key

This quick analysis shows that using a tree is not ideal, either space-wise or performance- wise. If we look more closely at the tree in figure 6.1, we can see that there seems to be a lot of overhead: all words start with the character “a” , but this is stored once for every node of the subtree in the illustration, and for each step in the traversal of the tree, it will be compared with the text that is being searched (or inserted).

Looking at the path to search for the word “antic”, all the four nodes shown and tra­versed share the same prefix, “ant”. Wouldn’t it be nice if we could somehow compress these nodes and only store the common prefix once with the deltas for each node?

3. Description and API

The data structure that we are going to introduce in the next section was created to answer these needs, and also to offer an efficient way to solve another operation: find all the keys in the container that start with the same prefix.

From our example in the previous sub-section, you can already see that if we were able to somehow store the common prefixes of strings only once, we should be able to quickly access all strings starting with those prefixes.

Table 6.1 shows the public API for an abstract data structure (ADT) that supports the usual container basic operations, plus two new ones: retrieving all the strings starting with a certain prefix and finding the longest prefix of a string stored in the container.

From what we saw in the previous example, PrefixTree could be a good name for this data structure, although StringContainer is more generic (as an abstract data structure, we don’t care if it’s implemented using a tree or some other concrete coun­terpart.) It conveys the gist of this container: being specific for strings. The fact that prefix search is supported is almost a natural consequence of designing a container for strings.

Now that we have fixed an API and described the ADT that we will use to solve our spell-check problem, we are ready to delve into more details and see a few concrete data structures that could implement this ADT.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *