Trie, radix trie: Applications

Now that we know two concrete data structures for implementing the ADT String- Container, we can confidently look at applications where they make a difference.

As is usual with data structures, the difference is not about new things that couldn’t be done without tries, but rather about doing some operations better or faster than with other DSs.

This is particularly true for tries, as they were specifically designed to improve the running time of string-based queries. As one of the main uses for tries is to implement text-based dictionaries, the touchstone will often be hash tables.

1. Spell-checker

Time to go back to our main example! We saw in chapter 4 that Bloom filters were used for the first versions of spell-checkers, but after a while they were replaced with more efficient alternatives, such as tries.

The first step to build a spell-checker is, obviously, inserting all the keys from our dictionary (here meant as “English dictionary,” not the data structure!) in a trie.

Then, using the trie for spell check when the feedback we want is just highlighting typos is simple. We just need to perform a search, and if it’s a miss, we have a typo.

But suppose, instead, that we would like to also provide suggestions about how we could correct the typo—how can we do that with a trie?

Let’s say that the word w we are checking has m characters, and we can accept sug­gestions differing by at most k characters from w: in other words, we want words whose Levenshtein distance (also known as edit distance) is at most k.

To find those words in a trie, we start traversing the tree from the root, and while traversing it we keep an array of m elements, where the i-th element of this array is the smallest edit distance necessary to match the key corresponding to the current node to the first i characters in our search string.

For each node N, we check the array holding the edit distances:

  • If all distances in the array are greater than out maximum tolerance, then we can stop; there’s no need to traverse its subtree any further (because the dis­tances can only grow).
  • Otherwise, we keep track of the last edit distance (the one for the whole search string), and if it’s the best we have found so far, we pair it with current node’s key and store it.

When we finish traversing, we will have saved the closest key to our search string and its distance.

Figure 6.22 shows how the algorithm works on a simplified example. It uses a trie, but the same algorithm, with minor changes, can easily be shown and implemented on radix trees. In fact, for this algorithm we are mostly interested in key nodes, not intermediate ones.

The algorithm starts at the root that corresponds to the empty string (because the path that leads to it is also empty). At each step, we have to compare the target word s (“amt”, in the example) to the word corresponding to the current node; in particular, we compute the distance between each prefix of s and the word associated to the cur­rent node.

So, for the root, the distance between the empty string and the empty prefix of “amt” (which also is the empty string, obviously) is 0 (because they match). The dis­tance between “” and “a” is 1, because we need to add one character to the former to make the latter, and so on.

After computing our vector of distances, we traverse any outgoing edge and repeat the process for the next nodes. In this case, there is only one, associated with the string “a” , the concatenation of the labels leading to it. We can build the next row in the table using only the previous row, and compare the last character (the one mark­ing the last edge traversed) to each character in s (note that for the empty string col­umn, the distance will always be the length of the path).

Therefore, the second row in our table will start with a 1, then have 0 in cell [1,1], because both strings start with an ‘a’. For the next character in s, there isn’t a corre­sponding char in the node’s key (because it’s shorter), so we need to add 1 to have its distance, and the same for the last character. In fact, as a double-check, if we consider the prefix “am”, the distance to “a” is 1, while for “amt” is 2.

Notice that the cost to compare two strings is always contained in the bottom-right cell, so for “a” and “amt” this distance is 2.

The algorithm goes on traversing all branches, until we get to a point where the cost can’t decrease any more (when the path is already as long as the string s or lon­ger, there is no point in going down a branch as soon as we find a key node) or all the distances in the last row are larger than a user-defined threshold, the max meaningful distance. This is particularly useful when searching long strings that would otherwise cause most of the tree to be traversed (while words within a distance of 2-3 characters are the most likely anyway).

As you saw in figure 6.22, the smallest distance is obtained for the path “ant”; however, there is a catch! This trie, in fact, doesn’t contain “ant” as a key, and there­fore we can’t take this value into consideration.

Instead, there are several keys at distance 2, and any of them, or all of them, can be returned as a suggestion.

How fast can we find a suggestion? As we know at this point, after the discussion in section 6.1, searching a string in a trie has a better worst-case running time than the alternatives: O(m) comparisons for a string of length m, while for hash tables or binary search trees it would be O(m + log(n)) at best.

2. String similarity

The similarity between two strings is a measure of the distance that separates them. Usually, it’s some function of the number of changes needed to transform one string into the other.

Two examples of these distances are

  • The Levenshtein distance, the number of single-character edits
  • The Hamming distance, the number of positions in which the strings are different

As we have seen in the previous sub-section, string similarity is used by spell-checkers to decide the best suggestions to correct typos.

But recently another even more important use case has become popular: bioinfor­matics, matching sequences of DNA. This is a computationally intensive task, so using the wrong data structures can make it impossible to solve.

When we have to compare just two strings, directly computing the Levenshtein dis­tance is the most effective way to go; however, if we have to compare a single string to n other strings to find the best match, computing n times the Levenshtein distance becomes impractical. The running time would be O(n * m * M), where m is the length of our search string and O(M) is the average length of the n strings in the corpus.

Turns out, we can do much better using a trie. By using the same algorithm shown for spell-checkers (without the threshold-based pruning), computing all the distances will only take time O(m * N), where N is the total number of nodes in the trie, and while the trie construction could take up to O(n * M), it would only happen once at the beginning, and if the rate of lookups is high enough, its cost would be amortized.

In theory N can be O(n * M), as we saw in section 6.2, if no two strings in the corpus share the same prefix. In practice, however, it is likely that N is order of magnitudes smaller than n * M, and closer to O(M). Moreover, as we saw in the last sub-section, if we set a threshold for the max tolerance, that is, for the largest difference between two strings, and keep track of the best result we have found, we can prune even more the number of nodes we traverse during search.

3. String sorting

Burstsort is a cache-efficient sort algorithm that works similarly to MSD (Most Signifi­cant Digit) radix sort. However, burstsort is cache-efficient and even faster than radix sort!

They both have the same asymptotic running time, O(n * M), which is a theoretical lower bound for sorting n strings of length M, but burstsort creates results twice as fast by exploiting locality of reference and better memory distribution.

Going into the details of this algorithm is out of the scope of this chapter, but to give you an idea of how burstsort works, it dynamically constructs a trie while the strings are sorted, and uses it to partition them by assigning each string to a bucket (similar to radix sort). The asymptotic cost, as mentioned, is the same as MSD’s, because leading characters of each string are inspected once only. The pattern of memory accesses, however, makes better use of cache.

While MSD, prior to the bucket-sorting phase, accesses each string once for each character, burstsort accesses each string only once overall. The trie nodes, instead, are accessed randomly.

However, the set of trie nodes is much smaller than the set of strings, so cache is used more wisely.

If the set of string exceeds cache size, burstsort becomes considerably faster than any other string sorting algorithms.

T9 was such a big milestone in mobile history that we still (mistakenly) address new mobiles’ spell-checkers as T9—though it was abandoned long ago with the advent of smartphones.

The name comes as an abbreviation of “Text on 9 keys,” as the alphabet was (long before mobile phones) divided into groups of three to four characters that would fit into a digital phone numpad.

In the original design for landline phones, every number had to be pressed one to four times to choose every single letter. For instance, 2 had to be pressed once for ‘a’, twice for ‘b’, and thrice for ‘c’.

Instead, the idea with T9 was that the user would press each key once for each let­ter in the word, to state that the i-th letter belonged to the group of the k-th button. Then T9 would offer suggestions for possible words made out of those combinations of letters, or even directly provide the right word, if only one possible match was found.

For instance, typing 2-6-3 would select all three letters combinations to form the Cartesian product [a,b,c]x[m,n,o]x[d,e,f], and T9 would provide valid English words such as [and, cod, con,…].

This was made possible by keeping a trie, and for each key pressed refining the search:

  1. When keypad button 2 is pressed, we would start traversing the trie, going, in parallel, to the subtrees linked by edges marked with ‘a’, ‘b’,’c’ (all three of them would likely be in the trie for any language using the Latin alphabet).
  2. When the second keypad button is pressed, for each of the three paths that we are currently traversing, T9 checks to see if they have children labeled with ‘m’ , ‘n’ and ‘o’, and keeps track of the nodes reached at this second level. Each combination represents a path from the root to a level-2 node. Most likely, not all of the 9 combinations would have a path in the trie: for example, it’s unlikely any word will start with “bn”.
  3. The process continues with the next buttons pressed, until there is no node reachable through the possible paths traversed.

For this specific task, the trie nodes would likely store more than just a Boolean for each key. They would rather store the corpus frequency of that word (for instance, how likely it is that a word is used in English). This way, the most likely result would be returned when more than one result is available: for instance, you would somehow expect that and would be preferred over cod.

5. Autocomplete

In the last 10 years or so, we have all become familiar with the autocomplete feature of search boxes. Today we even expect search boxes to provide it by default.

The usual autocomplete workflow is the following: A user on the client side (typi­cally a browser) starts typing in a few letters, and the autocomplete search box shows a few options that start with the characters already typed (Does this ring a bell? Sounds like “all keys with prefix”?). If the set of possible values that could be inserted in the search box was static and small, then it could be transmitted to the client together with the page, cached, and used directly on the client.

This, however, is usually not the case, since the sets of possible values are normally large, and they might change over time, or even be dynamically queried.

So, in real-world applications the client usually sends (asynchronously) a REST request to the server with the characters typed so far.

The application server keeps a trie (or more likely a Patricia tree) with the valid entries, searches for the strings starting with what was inserted so far and a valid pre­fix, and returns a certain number of entries in this string’s subtree.

When the response comes back to the client, it simply shows the list of results from the server as suggestions.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *