R-tree

The first evolution of k-d trees we will discuss are R-trees. Although we won’t delve into the details of their implementation, we are going to discuss the idea behind this solution, why they work, and their high-level mechanism.

R-trees were introduced in 1984 by Antonin Guttman in the paper “R-Trees. A Dynamic Index Structure For Spatial Searching.”

They are inspired by B-trees, balanced trees with a hierarchical structure. In par­ticular, Guttman used as a starting point B+ trees, a variant where only leaf nodes con­tain data, while inner nodes only contain keys and serve the purpose of hierarchically partitioning data.

1. A step back: Introducing B-trees

Figure 10.2 shows an example of a B-tree, in particular a B+tree. These data structures were meant to index unidimensional data, partitioning it into pages,[2] providing effi­cient storage on disk and fast search (minimizing the number of pages loaded, and so the number of disk accesses).

Each node (both internal nodes and leaves) in a B-tree contains between d-1 and 2*d-1 keys, where d is a fixed parameter for each tree, its branching factor: the (min­imum, in this case) number of children for each node. The only exception can be the root, which can possibly contain fewer than d-1 keys. Keys are stored in an ordered list; this is fundamental to have a fast (logarithmic) search. In fact, each internal node with m keys, k0, k1, …,  km-1, d-1 < m < 2*d*-1, also has exactly m+1 children, C0, C1, …,  Cm-1, Cm, such that k < k0 for each key k in the subtree rooted in C0; k0 < k < k1 for each key k in the subtree rooted in C1; and so on.

In a B-tree, keys and items are stored in the nodes, each key/item is stored exactly once in a single node, and the whole tree stores exactly n keys if the dataset has n items. In a B+ tree, internal nodes only contains keys, and only leaves contain pairs, each with keys and links to the items. This means that a B+tree storing n items has n leaves, and that keys in internal nodes are also stored in all its descendants (see how, in the exam­ple in figure 10.2, the keys 4, 10, and 45 are also stored in leaves).

Storing links to items in the leaves, instead of having the actual items hosted in the tree, serves a double purpose:

  • Nodes are more lightweight and easier to allocate/garbage collect.
  • It allows storing all items in an array, or in a contiguous block of memory, exploiting the memory locality of neighboring items.

When these trees are used to store huge collections of large items, these properties allow us to use memory paging efficiently. By having lightweight nodes, it is more likely the whole tree will fit in memory, while items can be stored on disk, and leaves can be loaded on a need-to basis. Because it is also likely that after accessing an item X, applications will need to access one of its contiguous items, by loading in memory the whole B-tree leaf containing X, we can reduce the disk reads as much as possible.

Not surprisingly, for these reasons B-trees have been the core of many SQL data­base engines since their invention[4]—and even today they are still the data structure of choice for storing indices.

2. From B-Tree to R-tree

R-trees extend the main ideas behind B+trees to the multidimensional case. While for unidimensional data each node corresponds to an interval (the range from the left­most and right-most keys in its sub-tree, which are in turn its minimum and maximum keys), in R-trees each node n covers a rectangle (or a hyper-rectangle in the most generic case), whose corners are defined by the minimum and maximum of each coordinate over all the points in the subtree rooted at n.

Similarly to B-trees, R-trees are also parametric. Instead of a branching factor d controlling the minimum number of entries per node, R-trees require their clients to provide two parameters on creation:

  • M, the maximum number of entries in a node; this value is usually set so that a full node will fit in a page of memory.
  • m, such that m ≤ M/2, the minimum number of entries in a node. This parame­ter indirectly controls the minimum height of the tree, as we’ll see.

Given values for these two parameters, R-trees abide by a few invariants:

  1. Every leaf contains between m and M points (except for the root, which can pos­sibly have less than m points).
  2. Each leaf node L has associated a hyper-rectangle RL, such that RL is the smallest rectangle containing all the points in the leaf.
  3. Every internal node has between m and M children (except for the root, which can possibly have less than m children).
  4. Each internal node N has associated a bounding (hyper-)rectangle RN, such that RN  is the smallest rectangle, whose edges are parallel to the Cartesian axes, entirely containing all the bounding rectangles of N’s children.
  5. The root node has at least two children, unless it is a leaf.
  6. All leaves are at the same level.

Property number 6 tells us that R-trees are balanced, while from properties 1 and 3 we can infer that the maximum height of an R-tree containing n points is logm (n).

On insertion, if any node on the path from the root to the leaf holding the new point becomes larger than M entries, we will have to split it, creating two nodes, each with half the elements.

On removal, if any node becomes smaller than m entries, we will have to merge it with one of its adjacent siblings.

Invariants 2 and 4 require some extra work to be maintained true, but these bounding rectangles defined for each node are needed to allow fast search on the tree.

Before describing how the search methods work, let’s take a closer look at an example of an R-tree in figures 10.3 and 10.4. We will stick to the 2-D case because it is easier to visualize, but as always, you have to imagine that real trees can hold 3-D, 4-D, or even 100-D points.

If we compare figure 10.3 to figure 9.6, show­ing how a k-d tree organizes the same dataset, it is immediately apparent how the two partition­ings are completely different:

  • R-trees create regions in the Cartesian plane in the shape of rectangles, while k-d trees split the plane along lines.
  • While k-d trees alternate the dimension along which the split is done, R-trees don’t cycle through dimensions. Rather, at each level the sub-rectangles created can parti­tion their bounding box in any or even all dimensions at the same time.
  • The bounding rectangles can overlap, both across different sub-trees and even with siblings’ bounding boxes sharing the same parent. However, and this is crucial, no sub-rectangle extends outside its parent’s bounding box.
  • Each internal node defines a so-called bounding envelope, that for R-trees is the smallest rectangle containing all the bounding envelopes of the node’s children.

Figure 10.4 shows how these properties translate into a tree data structure; here the difference with k-d trees is even more evident!

Each internal node is a list of rectangles (between m and m of them, as mentioned), while leaves are lists of (again, between m and m) points. Each rectangle is effectively determined by its children and could indeed be defined iteratively in terms of its chil­dren. For practical reasons such as improving the running time of the search methods, in practice we store the bounding box for each rectangle.

Because the rectangles can only be parallel to the Cartesian axes, they are defined by two of their vertices: two tuples with k coor­dinates, one tuple for the minimum values of each coordinate, and one for the maxi­mum value of each coordinate.

Notice how unlike k-d trees, an R-tree could handle a non-zero-measure object by simply considering its bounding boxes as special cases of rectangles, as illustrated in figure 10.5.

3. Inserting points in an R-tree

Now, of course, you might legitimately wonder how you get from a raw dataset to the R-tree in figure 10.5. After all, we just presented it and asked you to take it as a given.

Insertion for R-trees is similar to B-trees and has many steps in common with SS-trees, so we won’t duplicate the learning effort with a detailed description here.

At a high level, to insert a new point you will need to follow the following steps:

  • Find the leaf that should host the new point P. There are three possible cases:
    • P lies exactly within one of the leaves’ rectangles, R. Then just add P to R and move to the next step.
    • P lies within the overlapping region between two or more leaves’ bounding rectangles. For example, referring to figure 10.6, it might lie in the intersection of R12 and R14. In this case, we need to decide where to add P; the heuristic used to make these decisions will determine the shape of the tree (as an example, one heuristic could be just adding it to the rectangle with fewer elements).
    • If P lies outside of all rectangles at the leaves’ level, then we need to find the closest leaf L and add P to it (again, we can use more complex heuristics than just the Euclidean distance to decide).
  • Add the points to the leaf’s rectangle R, and check how many points it contains afterward:
    • If, after the new point is added, the leaf still has at most M points, then we are done.
    • Otherwise, we need to split r into two new rectangles, R1 and R2, and go to step 3.
  • Remove r from its parent RP and add R1 and R2 to RP. If rp now has more than m children, split it and repeat this step recursively.
    • If R was the root, we obviously can’t remove it from its parent; we just create a new root and set R1 and R2 as children.

To complete the insertion algorithm outlined here, we need to provide a few heuristics to break ties for overlapping rectangles and to choose the closest rectangle, but even more importantly, we haven’t said anything about how we are going to split a rectangle at points 2 and 3.

This choice, together with the heuristic for choosing the insertion subtree, deter­mines the behavior and shape (not to men­tion performance) of the R-tree.

Several heuristics have been studied over the years, each one aiming to optimize one or more usages of the tree. The split heuris­tics can be particularly complicated for inter­nal nodes because we don’t just partition points, but k-dimensional shapes. Figure 10.7 lead to inefficient splits.

Delving into these heuristics is out of the scope of this section; we refer the curious reader to the original paper by Antonin Guttman for a proper description. At this point, though, we can already reveal that the complexity of handling hyper-rectangles and obtaining good splits (and merges, after removals) is one of the main reasons that led to the introduction of SS-trees.

4. Search

Searching for a point or for the nearest neighbor (NN) of a point in R-trees is very similar to what happens in k-d trees. We need to traverse the tree, pruning branches that can’t contain a point, or, for NN search, are certainly further away than the cur­rent minimum distance.

Figure 10.8 shows an example of an (unsuccessful) point search on our example R-tree. Remember that an unsuccessful search is the first step for inserting a new point, through which we can find the rectangle (or rectangles, in this case) where we should add the new point.

The search starts at the root, where we compare point P’s coordinates with the boundaries of each rectangle, R and R2; P can only be within R2, so this is the only branch we traverse.

At the next step, we go through R2’s children, R5 and R6. Both can contain P, so we need to traverse both branches at this level (as shown by the two curved arrows, leav­ing R2 in the bottom half of figure 10.8).

This means we need to go through the children of both rectangles R5 and R6, checking from Rn to R14. Of these, only R12 and R14 can contain P, so those are the only rectangles whose points we will check at the last step. Neither contains P, so the search method can return false, and optionally the two leaves’ rectangles that could host P, if inserted.

Nearest neighbor search works similarly, but instead of checking whether a point belongs to each rectangle, it keeps the distance of the current nearest neighbor and checks to see if each rectangle is closer than that (otherwise, it can prune it). This is sim­ilar to the rectangular region search in k-d trees, as described in section 9.3.6.

We won’t delve into NN-search for R-trees. Now that you should have a high-level under­standing of this data structure, we are ready to move on to its evolution, the SS-tree.

It’s also worth mentioning that R-trees do not guarantee good worst-case perfor­mance, but in practice they usually perform better than k-d trees, so they were for a long time the de facto standard for similarity search and indexing of multidimen­sional datasets.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *