Disjoint sets: Heuristics to improve the running time

The next step in our quest for optimal performance is to make sure findPartition is logarithmic even in the worst-case scenario. Luckily, this is pretty easy! We discussed balanced trees in appendix C; feel free to check it out if you feel you could use a refresher.

Long story short, here we can easily keep track of the rank (aka size) of each tree, using linear extra space and performing constant-time extra operations in method merge, where we will update rank only for trees’ roots.

When we merge two trees, we will make sure to set as child the tree with the small­est number of nodes, as shown in figure 5.10.

It can be proven by induction that this tree will also be the one with the smallest height: this means the new tree will either have the same height as the old one, or just have its height increased by 1. It is also provable that the height of a tree can’t be increased more than a logarithmic number of times.

As a logarithm grows really slowly (for instance ln(1000) ~= 10, ln(1000000) ~= 20), this is, in practice, already a good result, sufficient for most applications.

However, if you are writing some really critical core code, such as a kernel or firm­ware code, you might want to do even better.

Why? Well, because you can. And sometimes also because you need to. If you shave 0.001ms over an operation you will repeat a billion times, you’ve already saved 16 min­utes of computation.

NOTE Most of the time, in our job as developers, performance isn’t the only metric to consider regarding this kind of improvement. First, it depends on whether you are saving those 16 minutes over a computation taking an hour or a day (needless to say, in the latter situation the gain would be irrelevant). But it also depends at what price you get the savings. If it makes your code ter­ribly more fragile, more complicated, and harder to maintain, or just requires weeks of development time, you will have to weigh the pros and cons before going down this path. Luckily for disjoint sets, this is not the case, and path compression is easy to implement, while it gives a big gain.

Let’s see how we could have this further improvement before delving into code.

1. Path compression

As hinted in the previous section, we can do even better than just having balanced trees and logarithmic-time methods.

To improve our results further, we can use a heuristic called path compression. The idea, shown in figure 5.11, is even simpler: for each node in the trees, instead of stor­ing a link to its parent, we can store one to the root of the tree. After all, we don’t really need to keep track of the history of the merges we performed; we just need to know at the current time what the root is for an element’s partition—and find that out as quickly as we can.

Now, if you were to update all the root pointers as part of merge, it wouldn’t be a loga­rithmic method anymore; we would need linear time to update each node in the tree.

But let’s see—what happens if we don’t update immediately the parent pointers in the nodes of the tree set as child? Simply put, next time we run findPartition on one of the elements in that tree—call it x—we need to walk the tree from x up to its old root xR, and then from xR to the new root R.

Keep in mind that the pointers for the elements in the old tree could have been in sync before the merge (and then we would just need two hops to get to the new root; see figure 5.12), or they might have never been updated.

Because we will have to walk up the tree anyway, we can then retrace our steps from the top, R, down to x and update the root pointers for all those elements. This won’t influence our asymptotic performance for findPartition, since by retracing the same path we just double the number of steps (and constant factors are irrelevant in asymptotic analysis; see appendix B).

But as a consequence of these extra steps that we take, next time we call findPartition on any elements in the path from x to root(x), we know for sure that those pointers will be up to date and we will just need one single step to find their root.

At this point, we would like to understand how many times we will need to update the root pointers, on average, for a single operation or, in amortized analysis, over a certain number k of operations. This is where the analysis of the algorithm gets a bit complicated.

We won’t go into its details. Just know that it is proven that the running amortized time for m calls to findPartition and merge on a set of n elements will require O(m * Ack(n) ) array accesses.

Here Ack(n) is an approximation of the inverse Ackermann function, a function growing so slowly that it can be considered a constant (its value will be lower than 5 for any integer that can be stored on a computer).

So, we managed to obtain an amortized constant bound for all the operations on this data structure! If you are not impressed by this result . . . you should be!

It is not yet known if this is the lowest bound for the Union-Find data structure. It has been proven,[1] though, that O(m * InvAck(m, n)) is a strict lower bound, where InvAck(m, n) is the true inverse Ackermann function.

I know, this is a lot to take in. But do not despair; it turns out that we only need a few small changes to implement the path compression heuristic.

2. Implementing balancing and path compression

We will now discuss the final implementation of our disjoint set structure, including both the “tree balancing by rank” and “path compression” heuristics.

For each element, we’ll have to store some information about its subtree. There­fore, we’ll use a helper (private) class Info to gather all the info together, as shown in listing 5.8.

This Info class models (the info associated with) a node of the partitions’ tree. It is, to all purposes, just a container for two values: the root of the tree and the rank (that is, size) of the tree rooted at the current element.

In the root property, we won’t store references to other nodes. Instead, we will directly store the (index of) the element itself, that we then use as a key to a HashMap, exactly as we have shown in the previous sections.

If we were actually modeling a tree data structure, this design would result in imperfect encapsulation. But we are just using the Info class as a tuple to gather all the properties associated with an element.

Most implementations for disjoint set would use two arrays for this. Since our implementation does not restrict keys to integers and we are using hash maps, we could define two Maps for the element’s roots and ranks. In doing so, however, we would store each element three times: twice as a key of each map and once as a root of some tree (this last entry could store, of course, some keys several times and some oth­ers not once).

By using this extra wrapper and a single “info” map, we make sure to store ele­ments only once as keys.

While objects would be stored as reference, with minimal overhead, immutable val­ues, and especially strings, would be stored by value. Therefore, even avoiding store- ing each element once more can lead to consistent memory saving.

We could, in theory, do even better by storing each element in object wrappers and using those wrappers as keys. This way, we would only store each key once, and use wrappers’ references all the time, both as keys for our map(s) and as values.

Is the overhead and increased complexity of the wrapper solution worth it? This depends on the assumptions you can make on the type and size of the keys. In most cases, it is probably not, so be sure to properly profile your application and analyze your input before embarking on such optimizations.

To go back to our implementation: once again, changes are minimal. In both con­structor and add, we just need to update the very last line:

parentsMap[elem] = new Info(elem)

We use the constructor for Info and create a new instance associated with each element.

Things definitely get more interesting when we move to findPartition, imple­mented in listing 5.9.

As described at the beginning of the section, when using the path compression heuris­tic, we don’t update the root of all elements on merge, but we do update it on find- Partition. So, the main difference from the older version is that we save the result of the recursive calls to findPartition at line #6, and we use it to update the current element’s root. Everything else remains exactly the same.

It goes without saying that the largest portion of changes will be implemented in the merge method, as shown in listing 5.10.

We still retrieve the elements at the roots of the trees as before, and still check that they are not the same.

But after that, we actually have to retrieve the info for both roots, and we check which tree is larger: the smaller one will end up as the child, and we will reassign its root. Moreover, we need to update the rank for the larger tree’s root; its subtree will now also contain all the elements in its new child.

If you’d like to look at some code for the heuristics implementation, we have an example on GitHub.

This is all we need to change in order to achieve a tremendous boost in perfor­mance. The simplicity of the code shows you how clever this solution is, and in the next sections we will also see why it is so important to get it right.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *