Improving priority queues – Performance analysis: Finding the best branching factor

We’ve discussed the theory. Now let’s try to apply it to a real case and describe how to profile the implementation of a data structure and an application.

We have seen that priority queues are a key component in the Huffman compres­sion pipeline. If we measure the performance of our heap’s methods as the number of swaps performed, we have shown that we have at most h swaps per method call, if h is the height of the heap. In section 2.5 we also showed that because the heap is a com­plete balanced tree, the height of a d-ary heap is exactly logD (n) .

Therefore, for both methods insert and top, it would seem that the larger the branching factor, the smaller the height, and in turn the better heap performance should be.

Butjust limiting to swaps doesn’t tell us the whole story. In section 2.9 we delve into the performance analysis of these two methods and take into consideration also the number of array accesses, or equivalently the number of comparisons on heap elements that are needed for each method. While insert accesses only one element per heap’s level, method top traverses the tree from root to leaves, and at each level it needs to go through the list of children of a node. Therefore, it needs approximately up to D*logD (n) accesses on a heap with branching factor D and containing n elements.

Tables 2.6 and 2.7 summarize the performance analysis for the three main meth­ods in heaps’ API.

For method top, a larger branching factor doesn’t always improve performance, because while logD (n) becomes smaller, D becomes larger. Bringing it to an extreme, if we choose D > n-1, then the heap becomes a root with a list of n-1 children, and so while insert will require just 1 comparison and 1 swap, top method will need n com­parisons and 1 swap (in practice, as bad as keeping an unsorted list of elements).

There is no easy way to find a value for D that minimizes function f(n) = D*logD (n) for all values of n, and besides, this formula gives us just an estimate of the maximum number of accesses/swaps performed. The exact number of comparisons and swaps actually performed depends on the sequence of operations and on the order in which elements are added.

Then the question arises: How do we choose the best value for the branching factor?

The best we can do here is profile our applications to choose the best value for this parameter. In theory, applications with more calls to insert than to top will perform better with larger branching factors, while when the ratio of calls between the two methods approaches 1.0, a more balanced choice will be best.

1. Please welcome profiling

And so we are stuck with profiling. If you are asking yourself “What’s profiling?” or “Where do I start?”, here are a few tips:

  • Profiling means measuring the running time and possibly the memory con­sumption of different parts of your code.
  • It can be done at a high level (measuring the calls to high level functions) or a lower level, for each instruction, and although you can set it up manually (mea­suring the time before and after a call), there are great tools to aid you—usually guaranteeing an error-free measurement.
  • Profiling can’t give you general-purpose answers: it can measure the perfor­mance of your code on the input you provide.
  • In turn, this means that the result of profiling is as good as the input you use, meaning that if you only use a very specific input, you might tune your code for an edge case, and it could perform poorly on different inputs. Also, another key factor is the data volume: to be statistically significant, it can be nice to collect profiling results on many runs over (pseudo)random inputs.
  • It also means that results are not generalizable to different programming lan­guages, or even different implementations in the same language.
  • Profiling requires time. The more in depth you go with tuning, the more time it requires. Remember Donald Knuth’s advice: “premature optimization is the root of all evil.” Meaning you should only get into profiling to optimize critical code paths. Spending two days to shave 5 milliseconds on an application that takes 1 minute is frankly a waste of time (and possibly, if you end up making your code more complicated to tune your app, it will also make your code worse).

If, in spite all of these disclaimers, you realize that your application actually needs some tuning, then brace yourself and choose the best profiling tool available for your framework.

To perform profiling, obviously we will have to abandon pseudocode and choose an actual implementation; in our example, we will profile a Python implementation of Huffman encoding and d-ary heap. You can check out the implementation on our repo[3] on GitHub.

Code and tests are written in Python 3, specifically using version 3.7.4, the latest stable version at the time of writing. We are going to use a few libraries and tools to make sense of the profiling stats we collect:

  • Pandas
  • Matplotlib
  • Jupyter Notebook

To make your life easier, if you’d like to try out the code I suggest you install the Ana­conda distribution, which already includes the latest Python distribution and all the packages listed.

To do the actual profiling, we use cProfile package, which is already included in the basic Python distro.

We won’t explain in detail how to use cProfile (lots of free material online covers this, starting from the Python docs linked previously), but to sum it up, cProfile allows running a method or function and records the per-call, total, and cumulative time taken by every method involved.

Using pStats.Stats, we can retrieve and print (or process) those stats; the output of profiling looks something like what’s shown in figure 2.18.

Now, to reduce the noise, we are only interested in a few methods, specifically the functions that use a heap: in particular _frequency_table_to_heap, which takes the dictionary with the frequency (or number of occurrences) of each character in the input text and creates a heap with one entry per character, and _heap_to_tree, which in turn takes the heap created by the former function and uses it to create the Huff­man encoding tree.

We also want to track down the calls to the heap methods used: _heapify, top, and insert. Instead of just printing those stats, we can read them as a dictionary, and filter the entries corresponding to those five functions.

To have meaningful, reliable results, we also need to profile several calls to the method huffman.create_encoding, and so processing the stats and saving the result to a CSV file seems the best choice anyway.

To see the profiling in action, check out our example[4] on GitHub. The example pro­files several calls to the method creating a Huffman encoding over an ensemble of large text files[5] and a bitmap image. The bitmap needs to be duly pre-processed in order to make it processable as text. The details of this pre-processing are not particularly inter­esting, but for the curious reader we encode image bytes in base64 to have valid characters.

2. Interpreting results

Now that we’ve stored the results of profiling in a CSV file, we just need to interpret them to understand what’s the best choice of branching factor for our Huffman encoding app. At this point it should look like a piece of cake, right?

We can do it in many ways, but my personal favorite is displaying some plots in a Jupyter Notebook: take a look here at an example.[6] Isn’t it great that GitHub lets you already display the Notebook without having to run it locally?

Before delving into the results, though, we need to take a step back: because we will use a boxplot[7] to display our data, let’s make sure we know how to interpret these plots first. If you feel you could use a hint, figure 2.19 comes to the rescue!

To make sense of the data our example notebook uses Pandas library, with which we read the CSV file with the stats and transform it into a DataFrame, an internal Pan­das representation. You can think about it as a SQL table on steroids. In fact, using this DataFrame we can partition data by test case (image versus text), then group it by method, and finally process each method separately and display results, for each method, grouped by branching factor. Figure 2.20 shows these results for encoding a large text, comparing the two main functions in Huffman encoding that use a heap.

Figure 2.19 Explaining how to read an example boxplot. Boxplots show the distribution of samples using a nice and effective chart. The main idea is that the most relevant samples are those whose values lie between the first and third quartile. That is, we find three values, Q1, Q2 (aka the median), and Q3 such that 25% of the samples are smaller than Q1; 50% of the samples are smaller than Q2 (which is the very definition of median, by the way); and 75% of the samples are smaller than Q3. The box in the boxplot is meant to clearly visualize the values for Q1 and Q3. Whiskers, instead, shows how far from the median the data extends. To that end, we could also display outliers, that is, samples that are outside the whiskers’ span. Sometimes outliers, though, end up being more confusing than useful, so in this example we will not use them.

If we look at those results, we can confirm a few points we mentioned in section 2.9:

  • 2 is never the best choice for a branching factor.
  • A branching factor of 4 or 5 seems a good compromise.
  • There is a consistent difference (even -50% for _frequency_table_to_heap) between a binary heap and the d-ary heap with best branching factor.

There is, however, also something that might look surprising: _frequency_table_ to_heap seems to improve with larger branching factors, while _heap_to_tree has a minimum for branching factors around 9 and 10.

As explained in section 2.6, which shows the pseudocode implementation of heap methods (and as you can see from the code on GitHub), the former method only calls the_push_down helper method, while the latter uses both top (which in turn calls _push_down) and insert (relying on _bubble_up instead), so we would expect to see the opposite result. Anyway, it is true that even _heap_to_tree has a ratio of two calls to top per insert.

Let’s now delve into the internals of these high-level methods, and see the running time per call of heap’s internal method _heapify (figure 2.22), and of API’s methods top and insert and their helper methods _push_down and _bubble_up (figures 2.21 and 2.23).

Before digging into the most interesting part, let’s quickly check out the insert method. Figure 2.21 shows there are no surprises here; the method tends to improve with larger branching factors, as does _bubble_up, its main helper.

The _heapify method, as expected, shows a trend similar to _frequency_table_to_heap, because all this method does is create a heap from the frequency table. Still, is it a bit sur­prising that _heapify‘s running time doesn’t degrade with larger branching factors?

Now to the juicy part. Let’s take a look at top in figure 2.23. If we look at the running time per call, the median and distribution clearly show a local minimum (around D==9), as we would have expected, considering that the method’s running time is O(D*logD (n)), confirming what we have previously discussed and summarized in table 2.7. Not surprisingly, the _push_down helper method has an identical distribution.

If we look at listings 2.7 and 2.4 and sections 2.6.2 and 2.6.4, it’s clear how pushDown is the heart of the top method, and in turn the largest chunk of work for pushDown is the method that at each heap’s level retrieves the child of current node. We called it highestPriorityChild.

And if then we look at the running time per call for _highest_priority_child (fig­ure 2.24 (A)), we have a confirmation that our profiling does make sense, because the running time per call increases with the branching factor. The larger D is, the longer the list of children for each node, which this method needs to traverse entirely in order to find which tree branch should be traversed next.

You might then wonder why _push_down doesn’t have the same trend. Remember that while the running time for _highest_priority_child is O(D), in particular with D comparisons, _push_down performs (at most) logD(n) swap sand D*logD(n) com­parisons, because it calls _highest_priority_child at most logD(n) times.

The larger the branching factor D, the fewer calls are made to _highest_priority_ child. This becomes apparent if, instead of plotting the running time per-call for _highest_priority_child, we use the total cumulative time (the sum of all calls to each method), as shown in figure 2.24 (B). There we can see again how this composite function, f(D) = D*logD(n), has a minimum at D==9.

3. The mystery with heapify

Summing up, _heapify keeps improving even at larger branching factors, although we can also say it practically plateaus after D==13, but this behavior is not caused by the methods top and _push_down, which do behave as expected.

There could be a few explanations for how the _heapify running time grows:

  1. It’s possible that, checking larger branching factors, we discover that there is a minimum (we just haven’t found it yet).
  2. The performance of this method heavily depends on the order of insertions: with sorted data it performs way better than with random data.
  3. Implementation-specific details make the contribution of calls to _push_down less relevant over the total running time.

But . . . are we sure that this is in contrast with theory? You should know by now that I like rhetorical questions; this means it is time to get into some math.

And indeed, if we take a closer look at section 2.6.7, we can find out that the num­ber of swaps performed by _heapify is limited by:

As you can imagine, analyzing this function is not straightforward. But, and here is the fun part, now you can certainly plot the function using your (possibly newly acquired) Jupyter Notebook skills. It turns out that, indeed, plotting this function of D for a few values of n, we get the charts in figure 2.25, showing that, despite the fact that the sin­gle calls to _push_down will be slower with a higher branching factor, the total number of swaps is expected to decrease.

So, mystery solved; _heapify does behave as expected. And good news, too: it gets faster as we increase the branching factor.

4. Choosing the best branching factor

For a mystery solved, there is still a big question standing: What is the best branching factor to speed up our Huffman encoding method? We digressed, delving into the details of the analysis of heaps and somehow left the most important question behind.

To answer this question, there is only one way: looking at figure 2.26, showing the running time for the main method of our Huffman encoding library.

It shows three interesting facts:

  • The best choice seems to be D==9.
  • Choosing any value larger than 7 will likely be as good.
  • While the max gain for _frequency_table_to_heap is 50%, and for _heapify even up to 80%, we merely get to a 5% here.

Notice how the chart for create_encoding looks more similar to the method _heap_ to_tree than to _frequency_table_to_heap. Also, considering the third point, the explanation is that the operation on the heap only contributes to a fraction of the run­ning time, while the most demanding methods needs to be searched elsewhere. (Hint: for create_frequency_table, the running time depends on the length of the input file, while for the other methods it only depends on the size of the alphabet.)

You can check out the full results, including the other examples, on the notebook on GitHub. Keep in mind that this analysis uses a limited input, and as such is likely to be biased. This is just a starting point, and I highly encourage you to try running a more thorough profiling using more runs on different inputs.

You can also pull the full profiling and visualization code and delve even deeper into the analysis.

To wrap up, I’d like to highlight a few takeaways:

  • D-ary heaps are faster than binary ones.
  • If you have parametric code, profiling is the way to tune your parameter. The best choice depends on your implementation as well as on input.
  • Be sure to run your profiling on a representative sample of your domain. If you use a limited subset, you will likely optimize for an edge case, and your results will be biased.

As a general rule, make sure to optimize the right thing: only profile critical code, and perform high-level profiling first to spot the critical section of code whose optimiza­tion would give you the largest improvement, and verify that this improvement is worth the time you’ll spend on it.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *