Improving priority queues: Use case

1. Use case: Find the k largest elements

In this section we are going to describe how we can use a priority queue to keep track of the k largest elements of a set.

If we have the full set of n elements in advance, we have a few alternatives that don’t need any auxiliary data structure:

  • We could sort the input and take the last k This naive approach requires O(n*log(n)) comparisons and swaps and, depending on the algo­rithm, might require additional memory.
  • We could find the largest element from the set and move it to the end of the array, then look at the remaining n-1 elements and find the second to last and move it to position n-2, and so Basically, this algorithm runs the inner cycle of Selection Sort algorithm k times, requiring O(k) swaps and O(n*k) com­parisons. No additional memory would be needed.

In this section we will see that by using a heap, we can achieve our goal using O(n+k*log(k)) comparisons and swaps, and O(k) extra memory. This is a game­changing improvement if k is much smaller than n. In a typical situation, n could be on the order of millions or billions of elements, and k between a hundred and a few thousand.

Moreover, by using an auxiliary heap, the algorithm can naturally be adapted to work on dynamic streams of data and also to allow consuming elements from the heap.

1.1. The right data structure …

When your problem involves finding a subset of largest/smallest elements, priority queues seem like a natural solution.

In programming, choosing the right data structure can make a difference.15 That’s not always enough, because you also need to use it correctly.

Suppose, for instance, that we have a static set of elements available from the beginning. We could use a max-heap, insert all n elements, and then extract the larg­est k of them.

We would need O(n) extra space for the heap, and then use heapify to create it from the full set in linear time, O(n). Then we would call top k times, with a cost of O(log(n)) for each call. The total cost for this solution would be O (n+k*log(n)) com­parisons and swaps.

That’s already better than the naive solutions, but, if you think about it, if16 n>>k, we are creating a huge heap just to extract a few elements. That certainly sounds wasteful.

1.2. … and the right use

So our goal should be having a small heap with k elements. Using a max-heap doesn’t really work anymore. Let’s see why with an example.

Suppose we want to find the largest three of the following numbers: 2, 4, 1, 3, 7, 6. We add the first three and we have the following max-heap: [4, 2, 1]. Now we proceed to the next number in the list, and it’s 3. It’s larger than two out of three elements cur­rently in the heap, but we have no way of knowing this, because we can only peek at the top of the heap. Then we can insert 3 into the heap, and we obtain [4, 3, 1, 2]. Now, if we want to keep the size of the heap at k elements, we need to remove one, which is the minimum. How do we know where it is inside the heap? We only know where the maximum is (at the top of the max-heap), and so the search for the min could take up to linear time (even noting that the minimum will be in one of the leaves, there are unfortunately a linear number of them, n/D).

You can verify that even by inserting the elements in a different order, we often find a similar situation.

The catch is that when we want the k largest elements, at each step we are inter­ested in understanding if the next number we evaluate is larger than the smallest of the k elements we already have. Hence, rather than a max-heap, we can use a min- heap bound to k elements where we store the largest elements found so far.

For each new element, we compare it to the top of the heap, and if the new one is smaller, we are sure it’s not one of the k largest elements. If our new element is larger than the heap’s top (that is, the smallest of our k elements), then we extract the top from the heap and then add our newly arrived. This way, updating the heap at each iteration costs us only constant time, instead of the linear time bound if we used a max-heap.

1.3. Coding it up

That’s neat, right? And simple to implement. Since code is worth a thousand words, let’s see our heap in action in listing 2.11.

2. More use cases

The heap is one of the most universally used data structures. Together with stack and queue, it is the basis of almost every algorithm that needs to process the input in a spe­cific order.

Replacing a binary heap with a d-ary heap can improve virtually any piece of code that uses a priority queue. Before delving into a few algorithms that can benefit from the use of a heap, make sure you are familiar with graphs, because most of these algo­rithms will concern this data structure. To that end, chapter 14 provides a quick intro­duction to graphs.

Let’s now discuss some algorithms that can benefit from the use of a heap.

2.1. Minimum distance in graphs: Dijkstra

Priority queues are crucial to implementing Dijkstra and A* algorithms, described in detail in chapter 14, sections 14.4 and 14.5. Figure 2.13 shows a minimal example of a graph, illustrating the concept of shortest path between two vertices. As we will discuss in chapter 14, the running time of these fundamental algorithms on graphs (which compute the minimum distance to a target) heavily depends on the implementation chosen for the priority queue, and upgrading from a binary to a d-ary heap can provide a consistent speedup.

2.2. More graphs: Prim’s algorithm

Prim’s algorithm computes the minimum spanning tree (MST) of an undirected, con­nected graph G.

Suppose G has n vertices. The minimum spanning tree of G is

  1. A tree (a connected, undirected, acyclic graph)
  2. That is a subgraph of G with n vertices and
  3. Whose sum of edges’ weights is the least possible among all of the subgraphs of G that are also trees and span over all n vertices

Considering the graph in the example shown in section 2.8.1, its minimum spanning tree would be the one in figure 2.14.

Prim’s algorithm works exactly as Dijkstra’s, except

  • Without keeping track of the distance from the source.
  • Storing the edge that connected the front of the visited vertices to the next closest vertex.
  • The vertex used as “source” for Prim’s algorithm is going to be the root of the MST.

It should be no surprise that its running time is similar to Dijkstra’s:

  • O(V2) using arrays (sorted or unsorted) for the priority queue
  • O(V*log(V) + E*log(V)) using binary or d-ary heap
  • O(V*log(V) + E) using Fibonacci heap

2.3. Data compression: Huffman codes

Huffman’s algorithm is probably the most famous data compression algorithm, and you have likely already heard of it if you took an “introduction to CS” course. It is a simple, brilliant, greedy algorithm that, despite not being the state of the art for com­pression anymore, was a major breakthrough in the ‘50s.

A Huffman code is a tree, built bottom-up, starting with the list of different charac­ters appearing in a text and their frequency. The algorithm iteratively

  1. Selects and removes the two elements in the list with the smallest frequency
  2. Then creates a new node by combining them (summing the two frequencies)
  3. And finally adds back the new node to the list

While the tree itself is not a heap, a key step of the algorithm is based on efficiently retrieving the smallest elements in the list, as well as efficiently adding new elements to the list. You probably have guessed by now that, once again, that’s where heaps come to the rescue.

Let’s take a look at the algorithm itself in listing 2.12.

We assume the input for the algorithm is a text, stored in a string (of course, the actual text might be stored in a file or stream, but we can always have a way to convert it to a string), and the output is a map from characters to binary sequences.

The first sub-task that we need to perform is to transform the text: we want to com­pute some statistics on it to identify the most used and least used characters in it. To that end, we compute the frequency of characters in the text.

The details of the ComputeFrequencies method at line #2 are both out of scope and (at least in its basic version) simple enough, and there is no need to delve into that helper method here.

Once we have computed the frequency map, we create a new priority queue and then at lines #4 and #5 we iterate over the frequency map, creating a new TreeNode for each character and then adding it to the priority queue, as in listing 2.12. Obvi­ously, considering the subject of this chapter, for the queue we use a heap, and in par­ticular a min-heap, where the element at the top is the one with the smallest value for the priority field. And in this case the priority field is (not surprisingly) the frequency field of the TreeNode.

Each TreeNode, in fact, contains two fields (besides the pointers to its children): a set of characters and the frequency of those characters in the text, computed as the sum of the frequencies of individual characters.

If you look at figure 2.15, you can see that the root of the final tree is the set of all characters in our example text, and hence the total frequency is 1.

This set is split into two groups, each of which is assigned to one of the root’s chil­dren, and so each internal node is similarly split until we get to leaves, each of which contains just one character.

Back to our algorithm, you can see how the tree in figure 2.15 is constructed bot­tom-up, and lines #2 to #5 in listing 2.12 take care of the first step, creating the leaves of the tree and adding them to the priority queue.

Figure 2.16 The first step in the Huffman coding algorithm. As mentioned in the text, the algorithm will use two auxiliary data structures, a priority queue and a binary tree. Each tree node will have a value, a set of characters in the text, and a priority, the sum of the frequencies of those characters in the text. (A) Initially, we create one tree node for each character, associated with its frequency in the text. We also add each node into a priority queue, using the frequency as its priority (smaller frequency means higher priority hence we would use a min-heap). (B) We extract two elements from the top of the priority queue. (C) We create a new tree node to which we add the two nodes extracted at step (B) as its children. By convention, we assume that the smallest node will be added as left child and the second-smallest as right child (but any consistent convention works here). The newly created node will hold the union of the set of characters in its children as value, and the sum of the two priorities as priority. (D) Finally, we can add the new root for this subtree back to the priority queue. Note that the nodes in the heap are showed in sorted order, but for the sake of simplicity the order in which nodes are stored inside a priority queue is an implementation detail, the contract for PQ’s API only guarantees that when we dequeue two elements, those will be the ones with the smallest frequencies.

Now, from line #6 we enter the core of the algorithm: until there is only one element left in the queue, we extract the top TreeNode entries, in lines #7 and #8. As you can see in figure 2.16 (B), those two elements will be the subtrees with the lowest frequen­cies so far.

Let’s call these subtrees L and R (the reason for these names will be apparent soon).

Figure 2.16 (C) shows the actions performed in lines #9 to #11 of our pseudocode: a new TreeNode is created (let’s call it P) by merging the two entries’ character sets and setting its frequency as the sum of the old subtrees’ frequencies. Then the new node and two subtrees are combined in a new subtree, where the new node P is the root and the subtrees L and R are its children.

Finally, at line #12 we add this new subtree back into the queue. As it’s shown in fig­ure 2.16 (D), it can sometimes be placed at the top of the queue, but that’s not always the case; the priority queue will take care of this detail for us (notice that here the pri­ority queue is used as a black box, as we discussed in section 2.4).

Listing 2.13 The Huffman coding algorithm (building a table from the tree)

function buildTable(node, sequence, charactersToSequenceMap)

if node.characters.size == 1 then

charactersToSequenceMap[node.characters[0]] ← sequence

else

if node.left <> null then

buildTable(node.left, 0 + sequence, charactersToSequenceMap)

if node.right <> null then

buildTable(node.right, 1 + sequence, charactersToSequenceMap)

return charactersToSequenceMap

These steps are repeated until there is only one element left in the queue (figure 2.17 shows a few more steps), and that last element will be the TreeNode that is the root of the final tree.

We can then use it in line #13 to create a compression table, which will be the final output of the huffman method. In turn, the compression table can be used to per­form the compression of the text by translating each one of its characters into a sequence of bits.

While we won’t show this last step, we provide listing 2.13 with the steps needed to create a compression table from the tree in figure 2.15. And even if this goes beyond the scope of this chapter (because the method doesn’t use a priority queue), providing a brief explanation should help those readers interested in writing an implementation of Huffman coding.

We wrote the buildTable method using recursive form. As explained in appendix E, this allows us to provide cleaner and more easily understandable code, but in some languages concrete implementations can be more performant when implemented using explicit iterations.

We pass three arguments to the method: a TreeNode node that is the current node in the traversal of the tree, a sequence that is the path from the root to current node (where we add a 0 for a “left turn” and a 1 for a “right turn”), and the Map that will hold the associations between characters and bit sequences.

At line #2, we check if the set of characters in the node has only one character. If it does, it means we have reached a leaf, and so the recursion can stop. The bit sequence that is associated with the character in the node is the path from root to current node, stored in the sequence variable.

Otherwise, we check whether the node has left and right children (it will have at least one, because it’s not a leaf) and traverse them. The crucial point here is how we build the sequence argument in the recursive calls: if we traverse the left child of current node, we add a 0 at the start of the sequence, while if we traverse the right child, we add a 1.

Figure 2.17 The result of the next couple of steps in the Huffman coding algorithm. (A) We dequeue and merge the top two nodes on the heap, C and D. At the end of this step, EF and CD become the two smallest nodes in the heap. (B) Now we merge those two nodes into CDEF, and we add it back to the heap. Which node between CDEF and B will be kept at the top of the priority queue is an implementation detail, and it’s irrelevant for the Huffman coding algorithm (the code will change slightly depending on which one is extracted first, but its compression ratio will remain unchanged). The next steps are easy to figure, also using figure 2.15 as a reference.

Table 2.5 shows the compression table produced starting from the tree shown in fig­ure 2.15; the last column would not be part of the actual compression table, but it’s useful to understand how the most used characters end up translated into shorter sequences (which is the key to an efficient compression).

Looking at the sequences, the most important property is that they form a prefix code: no sequence is the prefix of another sequence in the code.

This property is the key point for the decoding: iterating on the compressed text, we immediately know how to break it into characters.

For instance, in the compressed text 1001101, if we start from the first character, we can immediately see that sequence 10 matches b, then the next 0 matches A, and finally 1101 matches D, so the compressed bit sequence is translated into “BAD”.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *