Radix tries: Efficient string search

To overcome these issues with tries, a few alternatives have been developed: the ternary search trie (TST), which trades off lower memory usage for worse running time, or the radix trie, just to name a few.

While TSTs improve the space requirements to store links, and free us from worry­ing about platform-specific implementations to optimize how we store edges, the number of nodes we need to create is still on the order of magnitude of the number of characters contained in the whole corpus stored, O(n*m) for n words of average length m.

In tries, most of the nodes don’t store keys and are just hops on a path between a key and the ones that extend it. Most of these hops are necessary, but when we store long words, they tend to produce long chains of internal nodes, each with just one child. As we saw in section 6.2.1, this is the main reason tries need too much space, sometimes more than BSTs.

Figure 6.11 shows an example of a trie. Nothing special, just a small, regular trie. We can see that intermediate nodes always have children (assuming we prune dan­gling branches after deleting keys); sometimes just one child, sometimes more.

When an intermediate node has more than one child, we have several branches that we can traverse from it. When, instead, there is just one child, those two nodes begin to resemble a linked list. For example, take the first three nodes from the root of figure 6.11: they encode the prefix “an”, and the search of any other string starting with ‘a’ but not followed by an ‘n’ couldn’t get anywhere in the tree.

In fact, it turns out that an intermediate node is a branching point if it has more than one child: it means that the trie stores at least two keys sharing the common pre­fix corresponding to that node. If that’s the case, the node carries valuable informa­tion that we can’t compress in any way.

Key nodes also store information, regardless of the number of children they have. They tell us that the path to reach them composes a string that is stored in the trie.

If, however, an intermediate node stores no key and only has one child, then it car­ries no relevant information; it’s only a forced step in the path.

Radix tries (aka radix trees, aka Patricia trees) are based on the idea that we can somehow compress the path that leads to this kind of nodes, that are called pass­through nodes.

How? Figure 6.12 gives a hint about the process to compress these paths. Every time a path has a pass-through node, we can squash the section of the path hinging on these nodes into a single edge, which will be labeled with the string made concatenat­ing the labels of the original edges.

How much can we save with this change? Let’s look at the two trees in figure 6.12 to get an idea.

The original trie has 9 nodes and 8 edges, and with the assumptions made in sec­tion 6.2.1, with a 4-byte overhead per node, this means 9 * 4 + 8 * 9 = 108 bytes.

The compressed trie on the right has 6 nodes and 5 edges, but in this case each edge carries a string, not just a character; however, we can simplify the operation by accounting for edge references and string labels separately. This way, we would still count 9 bytes per edge (because we would include the string terminator byte in the edge cost), but we could add the sum of string lengths as a third term in the final expression; the total number of bytes needed is given by 6 * 4 + 5 * 9 + 8 * 1 = 77 bytes.

In other words, for this simple trie, the compressed version requires 30% less memory.

1. Nodes and edges

All the operations that we have described for tries can be similarly implemented for radix trees, but instead of edges labeled by chars, we need to store and follow edges labeled by strings.

While at a high level the logic of the methods is almost the same as for trie, to check which branch we should traverse, we can’t just check the next character in the key, because edges could be labeled with a substring that matches more than one character of our argument s.

One important property in these trees is that no two outgoing edges of the same node share a common prefix. This is crucial and allows us to store and check edges more efficiently.

A first solution is keeping edges in sorted order and using binary search to look for a link that starts with the next character c in the key. Because there can’t be two edges starting with c, if we find one, we can compare the rest of the characters in its label to the next characters in the string. Moreover, binary search allows us to find this edge in logarithmic time in the number of edges, and because there can’t be more than k=| Σ | edges per node (because there can be at most one starting with each character in our alphabet X), we know that the worst case running time for performing binary search and finding a candidate edge is O( log (k)).

Because k is a constant that doesn’t depend either on the number of keys stored in the trie or on the length of the words searched/inserted/etc., we can consider O(log(k))=O(1) as far as asymptotic analysis is concerned. Moreover, no extra space[2] is required to store edges with this solution.

This solution is illustrated in figure 6.13, where we also show how binary search works to find the possible edge matching.

Notice that the match between the string searched and an edge’s label doesn’t have (and usually isn’t) full; we’ll see in a moment what this means for our algorithms and how to handle these situations.

Of course, using sorted arrays, as we discussed in chapter 4 and appendix C, means logarithmic search, but linear (translated: slow!) insertion. Although the num­ber of elements can be considered a constant, in asymptotic analysis, from a practical point of view this implementation can significantly slow down insertion of new keys in large tries.

The alternative solutions to implement this dictionary for edges are the usual: bal­anced search trees, which would guarantee logarithm search and insertion, or hash tables. The latter is illustrated m figure 6.14. We can keep a dictionary whose keys are characters and whose values are the full string labels of the node’s edges, together with a reference to the children linked by the edge. This solution requires O(k) addi­tional space for a node with k children, and worst-case it will require O(| 21) extra space per node.

Despite requiring more space and a little bookkeeping on insertion and deletion to update the hash table, this solution allows amortized constant-time lookup when searching the path for a key.

Independently of the implementation, the first step will be comparing the first character of the input string to the first character of the edges’ label.

Overall, we have four possible cases, illustrated in figure 6.15:

  1. The string sE labeling an edge perfectly matches a substring of s, the input the string. This means s starts with sE, so we can break it up as s=sE+s ‘. In this case, we can traverse the edge to the children, and recurse on the input string s’.
  2. There is an edge starting with the first character in s, but sE is not a prefix of s; however, they must have a common prefix sP (at least one character long). The action here depends on the operation we are running. For search, it’s a failure because it means there isn’t a path to the searched key. For insert, as we’ll see, it means we have to decompress that edge, breaking down sE.
  3. The input string s is a prefix of the edge’s label sE. This is a special case of point #2 and can be handled similarly.
  4. Finally, if we don’t find a match for the first character, then we are sure we can’t traverse the trie any longer.

Now that we have clarified the high-level structure of radix trie’s nodes, let’s delve into the algorithms. Keeping in mind the considerations we just discussed, their behavior will flow naturally from trie’s methods.

Listing 6.12 shows the pseudo-code for the RadixTrie and RTNode classes, used to model this new data structure. We also added a class to model edges, to make code cleaner. I wonder if you have you noticed a tiny, but meaningful, detail: we don’t need to define a fixed alphabet beforehand, like for tries!

You can also take a look at a full implementation on the book’s repo on GitHub.

Listing 6.12 Class RadixTrie

class RTEdge

#type RTNode

destination

#type string

label

class RTNode

#type boolean

keyNode

#type HashMap<Char, RTEdge>

children

function RTNode(storesKey)

children ← new HashMap()

this.keyNode ← storesKey

class RadixTrie

#type RTNode

root

function RadixTrie()

root ← new RTNode( false)

3. Search

The search method, shown in listing 6.13, is almost identical to the trie’s counterpart; the only difference is the way we get the next edge to traverse. Because we are going to reuse this operation over and again for the other methods, we extract its logic into a utility method, shown in listing 6.14.

This method just looks for an edge with a common prefix with the target string s, if any. Remember that all edges can’t have any prefix in common, so there can be at most one starting with the same character as s.

It returns some useful information that the caller can use to decide the action to take: the longest common prefix between searched string and edge’s label, and the suffixes of these two strings (with respect to their common prefix).

At line #6 of listing 6.13, we use this information to distinguish between the four match cases illustrated in figure 6.15. The only positive case for search is the first one, so we need to check that there is an edge whose label is a prefix of s.

The implementation of the utility method is straightforward, assuming we have a way to extract the longest common prefix of two strings. This can be done by compar­ing the characters at the same indices in the two strings, one by one, until we find a mismatch.

We assume that this method is given, and it also returns the suffixes of the two strings, meaning two strings made of the remaining characters in each of the input strings, once stripped of their common prefix.

Figure 6.16 shows an example of the search method on the radix tree resulting from compressing the trie in figure 6.3.

Figure 6.16 Unsuccessful search for the string “antiquity” in the radix tree corresponding to the trie in figure 6.3. The first two diagrams show the initial steps in the search; then we fast forward to the final step.

3. Insert

As mentioned, cases #2 and #3 are the most complicated to handle, especially for method insert. When we find a partial match between a key and an edge, we will need to break the edge’s label down, split the edge into two new edges, and add a new node in the middle, corresponding to the longest common prefix sP.

This is illustrated in figure 6.17. Once the common prefix has been found, we need to add a new node in order to split the edge partially matching the string to insert, and then we can add a new branch to this new node.

This node we add is called a bridge node, because it will be a bridge between the existing node corresponding to the common prefix of the two strings, and the paths leading to the final nodes for these strings. Bridge nodes are, obviously, bifurcation points, where the path from root branches out.

To better understand this operation, it might help you to imagine that we decompress the edge to the child into a path, going back to the trie representation with one char per link. Then we traverse this path until we get to the end of the common prefix (to, say, a node B), we add a new branch as a child of B, and finally we compress again the two sub-paths on the two sides of B.

Listing 6.15 describes the pseudo-code for the insert method, and figures 6.18 and 6.19 show two examples of how this method works on a simplified tree.

The method follows the same high-level logic as the trie’s version we saw in listing 6.5. We traverse the tree as far as we can (following the longest path covering a prefix of the string to insert), and then add a new branch for the new key.

Traversing the tree becomes more complicated because at each node we need to dis­tinguish between the four different possible results of the edge label matching, and this complexity is reflected in the length of the method. Moreover, when we bump into case #2 or #3, we need to break down an edge and add a bridge node. Adding a new branch, however, becomes easier, because we just need to add a new edge and a single node.

4. Remove

Like for search, the only changes to the remove method, with respect to tries, revolve around the extraction of the common prefix. This is not surprising, because deleting a key can be thought of as a successful search followed by a clean-up of the deleted node.

With remove, we don’t have to worry about splitting edges or adding bridge nodes. Because we need to find the key first, there must be a path perfectly matching the key to be deleted, in order to be able to complete the operation. We might, however, have the chance to compact the final part of the deleted path, because turning an existing key node into an intermediate node can change the tree structure, introducing a pass­through node (see 6.3.1).

Figure 6.20 shows an example of remove in action on a radix tree.

Besides that, we might have to perform the usual pruning when deleting a key in a leaf. The difference with tries, in this case, is that we will only have to remove a single edge in radix trees.

The example in figure 6.20 shows both cases where we have to correct a node’s parent. We first remove the key from a leaf, and then, once the node is removed from the tree, its parent becomes a pass-through node, and hence can be removed, allow­ing us to compress the path from its parent to its only child.

Listing 6.16 shows the pseudo-code for this method, where we use two utility functions.

isPassThrough checks if a node is a pass-through node. This only happens when a node is not a key node, it has only one outgoing edge, and even its parent only has one outgoing edge (hence we need to pass the parent too). Implementation is left as an exercise.

Since a pass-through node only has one outgoing edge, its children field will have exactly one entry; getPassThroughEdge is a wrapper for retrieving this entry.

5. Longest common prefix

Porting this method from the trie’s version is straightforward. It’s just a matter of slightly modifying the search algorithm to take into account the different way we do edge matching. Listing 6.17 describes the pseudo-code for the radix trie’s version.

6. Keys starting with a prefix

For tries, this method leverages search to find the starting point of a full-fledged tra­versal, retrieving all keys in the subtree rooted at the prefix.

Unfortunately for radix tries, the situation is a bit more complicated, because pre­fixes that are not stored in the tree can partially match edges. Take, for instance, the tree in figure 6.20, where prefixes such as “a” or “anth” are not stored in the tree. The latter doesn’t even have a node at the end of the corresponding path, but still the radix trie contains several words starting with those prefixes.

If we were just looking for nodes that lie at the end of the path for those strings, we would miss all those legit results. We need, instead, to rewrite a special version of the search method for this operation, where we distinguish the edge match case #2, where we have a no-go, and case #3, where, instead, since the string fragment searched is a proper prefix of the last edge in the path, the subtree referenced by the edge will indeed contain strings that match the searched prefix. The difference between the two cases is illustrated in figure 6.21.

Listing 6.18 illustrates this new method, called searchNodeWithPrefix, to distin­guish it from an exact match search. The API method keysStartingWith and the util­ity method allKeysInBranch are, instead, basically identical to the equivalent method in tries, so we leave its pseudo-code as a useful exercise for the reader.

This concludes our rundown of the main methods for radix tries, and the discussion on data structures for efficient strings search. To the interested reader who would like to delve further into this subject, we suggest taking a look at suffix trees and suffix arrays, two interesting data structures that are fundamental to fields like bioinformatics, and that are unfortunately out of scope for this chapter.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *