Improving priority queues: Concrete data structures

Let’s now move from abstract to concrete data structures. Knowing how the API of a pri­ority queue works is good enough to use it, but often it is not enough to use it well. Espe­cially on time-critical components or in data-intensive applications, we often need to understand the internals of the data structures and the details of its implementation to make sure we can integrate it in our solution without introducing a bottleneck.

Every abstraction needs to be implemented using a concrete data structure. For example, a stack can be implemented using a list, an array, or in theory even a heap (although this would be silly, as will be clear in a few sections). The choice of the underlying data structure used to implement a container will only influence the con­tainer’s performance. Choosing the best implementation is usually a tradeoff: some data structures speed up some operations but will make other operations slower.

1. Comparing performance

For the implementation of a priority queue, we will initially consider three naive alter­natives using core data structures discussed in appendix C: an unsorted array, where we just add elements to its end; a sorted array, where we make sure the ordering is reinstated every time we add a new element; and balanced trees, of which heaps are a special case. Let’s compare, in table 2.2, the running times for basic operations implemented with these data structures.

2. What’s the right concrete data structure?

From table 2.2 we can see that naive choices would lead to a linear time requirement for at least one of the core operations, while a balanced tree would always guarantee a logarithmic worst case. Although linear time is usually regarded as “feasible,” there is still a tremendous difference between logarithmic and linear: for a billion elements, the difference is between 1 billion operations and a few dozens. If each operation takes 1 millisecond, that means going from 11 days to less than a second.

  • By saving an extra value with the minimum and charging the cost of maintaining its value on insert and delete.
  • If we use a buffer to speed up find-minimum, then we need to find the next minimum on delete. Unfortunately, nothing comes for free. Alternatively, we could have constant-time delete by giving up the buffer and swapping the deleted element with the last in the array, and linear time find- minimum.
  • Storing the array in reverse order, deleting the last element might just be a matter of shrinking the size of the array, or somehow keeping track of what’s the last element in the array.

Moreover, consider that most of the times containers, and priority queues in particular, are used as support structures, meaning that they are part of more complex algorithms/ data structures and each cycle of the main algorithm can call operations on the PQ sev­eral times. For example, for a sorting algorithm, this could mean going from O(n2), which usually means unfeasible for n as large as 1 million, or even less, to O(n*log(n)), which is still tractable for inputs of size 1 billion or more.[3] However, this would come at a cost, because the implementation of balanced binary trees is usually not trivial.

Next, we present a way to efficiently implement a generic priority queue.

3. Heap

A binary heap is the most used version of priority queues. It allows elements to be inserted and retrieved in either ascending or descending order, one at a time.

While in reality an underlying array is used to store heap’s elements, it can concep­tually be represented as a particular kind of binary tree, abiding by three invariants:

  • Every node has at most two children.
  • The heap tree is complete and left-adjusted. Complete (see figure 2.3) means that if the heap has height H, every leaf node is either at level H or H-1. All the levels are left-adjusted, which means that no right sub-tree has a height greater than its left sibling. So, if a leaf is at the same height as an internal node, the leaf can’t be on the left of that node. Invariant numbers 1 and 2 are the struc­tural invariants for heaps.
  • Every node holds the highest priority in the subtree rooted at that node.

Figure 2.3 Two examples of complete binary trees. All levels in the tree have the maximum
possible number of nodes, except (possibly) the last one. All leaves in the last level are leftadjusted; at the previous level, no right subtree has more nodes than its left sibling

Properties (1) and (2) allow for the array representation of the heap. Supposing we need to store N elements, the tree structure is directly and compactly represented using an array of N elements, without pointers to children or parents. Figure 2.4 shows how the tree and array representation of a heap are equivalent.

In the array representation, if we are starting to count indices from 0, the children of the i-th node are stored in the positions8 (2 * i) + 1 and 2 * (i + 1), while the parent of node i is at index (i – 1) / 2 (except for the root, which has no parent). For example, looking at figure 2.4, the node at index 1 (whose priority is 3) has two children at indices 3 and 4, and its parent is at index 0; if we consider a node at posi­tion 5, its parent is the element at index 2 and its children (not shown in the picture) would be at indices 11 and 12.

It might seem counterintuitive that we use an array to represent a tree. After all, trees were invented to overcome array limitations. This is generally true, and trees have a number of advantages: they are more flexible and, if balanced, allow for better performance, with worst-case logarithmic search, insertion, and delete.

Figure 2.4 A binary heap. Only priorities are shown for the nodes (elements are irrelevant here).The numbers inside the small squares show the indices of the heap elements in the array. Nodes are matched into array elements top to bottom, left to right. Top: the tree representation for the heap. Notice how every parent is smaller than (or at most equal to) its children and, in turn, all elements in its subtree. Bottom: the same heap in its array representation.

But the improvement we get with trees comes with a price, of course. First, as with any data structure that uses pointers (lists, graphs, trees, and so on) we have a memory overhead in comparison to arrays. While with the latter we just need to reserve space for the data (plus maybe, depending on the implementation details, some constant space for pointers and the node structure itself), every tree node requires extra space for the pointers to its children and possibly to its parent.

Without getting into too much detail, arrays also tend to be better at exploiting memory locality: all the elements in the array are contiguous in memory, and this means lower latency when reading them.

Table 2.3 shows how a heap matches a priority queue for its abstract part, and what its concrete data model and invariants are.

4. Priority, min-heap, and max-heap

When we stated the three heap’s properties, we used the wording “highest priority.” In a heap, we can always say that the highest priority element will be at the top, without raising any ambiguity.

Then when it comes to practice, we will have to define what priority means, but if we implement a general-purpose heap, we can safely parameterize it with a custom priority function, taking an element and returning its priority. As long as our imple­mentation abides by the three laws as stated in section 2.5.3, we will be sure our heap works as expected.

Sometimes, however, it’s better to have specialized implementations that leverage the domain knowledge and avoid the overhead of a custom priority function. For instance, if we store tasks in a heap, we can use tuples (priority, task) as elements, and rely on the natural ordering of tuples.

Either way, we still need to define what highest priority means. If we assume that highest priority means a larger number, that is, if p1 > p2 means p1 is the highest prior­ity, then we call our heap a max-heap.

At other times we need to return smallest numbers first, so we assume instead that p1 > p2 means p2 is the highest priority. In this case, we are using a min-heap. In the rest of the book, and in particular in the coding section, assume we are implementing a max-heap.

The implementation of a min-heap differs only slightly from a max-heap, the code is pretty much symmetric, and you just need to exchange all occurrences of < and < to > and > respectively, and swap min with max.

Or, alternatively, it’s even easier if you have an implementation of one to get the other by simply taking the reciprocal of priorities (either in your priority function or when you pass priority explicitly).

To give you a concrete example, suppose you have a min-heap that you use to store (age, task) pairs: a min-heap will return the task with smallest ages first. Because we might instead be interested in having the oldest task returned first, we would rather need a max-heap; it turns out we can get the same result without changing any code for out heap! We just need to store elements as tuples (-age, task) instead. If x.age < y.age, in fact, then -x.age > -y. age, and thus the min-heap will return first the tasks whose age have the largest absolute value.

For instance, if we have task A with age 2 (days) and task B with age 3, then we cre­ate the tuples (-2, A) and (-3, B); when extracting them from a min-heap, we can check that (-3, B) < (-2, A) and so the former will be returned before the latter.

5. Advanced variant: d-ary heap

One could think that heaps needs to be binary trees. After all, binary search trees are the most common kind of trees, intrinsically associated with ordering. It turns out that there is no reason to keep our branching factor[7] fixed and equal to 2. On the contrary, we can use any value greater than 2, and use the same array representation for the heap.

For a branching factor 3, the children of the i-th node are at indices 3*i + 1, 3*i + 2 and 3*(i + 1), while the parent of node i is at position (i – 1)/3.

Figure 2.5 shows both the tree and array representation of a 3-ary heap. The same idea holds for branching factors 4, 5, and so on.

For a d-ary heap, where the branching factor is the integer D > 1, our three heap invariants become

  • Every node has at most d
  • The heap tree is complete, with all the levels that are left-adjusted. That is, the i-th sub-tree has a height at most equal to its siblings on its left (from 0 to i-1, 1 < i <= D).
  • Every node holds the highest priority in the subtree rooted at that node.

FUN FACT It’s worth noting that for D = 1, the heap becomes a sorted array (or a sorted doubly linked list in its tree representation). The heap construc­tion will be the insertion sort algorithm and requires quadratic time. Every other operation requires linear time.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *