R-tree

The first evolution of k-d trees we will discuss are R-trees. Although we won’t delve into the details of their implementation, we are going to discuss the idea behind this solution, why they work, and their high-level mechanism.

R-trees were introduced in 1984 by Antonin Guttman in the paper “R-Trees. A Dynamic Index Structure For Spatial Searching.”

They are inspired by B-trees, balanced trees with a hierarchical structure. In particular, Guttman used as a starting point B+ trees, a variant where only leaf nodes contain data, while inner nodes only contain keys and serve the purpose of hierarchically partitioning data.

1. A step back: Introducing B-trees

Figure 10.2 shows an example of a B-tree, in particular a B+tree. These data structures were meant to index unidimensional data, partitioning it into pages,^[2] providing efficient storage on disk and fast search (minimizing the number of pages loaded, and so the number of disk accesses).

Each node (both internal nodes and leaves) in a B-tree contains between d-1 and 2*d-1 keys, where d is a fixed parameter for each tree, its branching factor: the (minimum, in this case) number of children for each node. The only exception can be the root, which can possibly contain fewer than d-1 keys. Keys are stored in an ordered list; this is fundamental to have a fast (logarithmic) search. In fact, each internal node with m keys, k₀, k₁, …, k_m-1, d-1 < m < 2*d*-1, also has exactly m+1 children, C₀, C₁, …, C_m-1, C_m, such that k < k₀ for each key k in the subtree rooted in C₀; k₀ < k < k₁ for each key k in the subtree rooted in C₁; and so on.

In a B-tree, keys and items are stored in the nodes, each key/item is stored exactly once in a single node, and the whole tree stores exactly n keys if the dataset has n items. In a B+ tree, internal nodes only contains keys, and only leaves contain pairs, each with keys and links to the items. This means that a B+tree storing n items has n leaves, and that keys in internal nodes are also stored in all its descendants (see how, in the example in figure 10.2, the keys 4, 10, and 45 are also stored in leaves).

Storing links to items in the leaves, instead of having the actual items hosted in the tree, serves a double purpose:

Nodes are more lightweight and easier to allocate/garbage collect.
It allows storing all items in an array, or in a contiguous block of memory, exploiting the memory locality of neighboring items.

When these trees are used to store huge collections of large items, these properties allow us to use memory paging efficiently. By having lightweight nodes, it is more likely the whole tree will fit in memory, while items can be stored on disk, and leaves can be loaded on a need-to basis. Because it is also likely that after accessing an item X, applications will need to access one of its contiguous items, by loading in memory the whole B-tree leaf containing X, we can reduce the disk reads as much as possible.

Not surprisingly, for these reasons B-trees have been the core of many SQL database engines since their invention^[4]—and even today they are still the data structure of choice for storing indices.

2. From B-Tree to R-tree

R-trees extend the main ideas behind B+trees to the multidimensional case. While for unidimensional data each node corresponds to an interval (the range from the leftmost and right-most keys in its sub-tree, which are in turn its minimum and maximum keys), in R-trees each node n covers a rectangle (or a hyper-rectangle in the most generic case), whose corners are defined by the minimum and maximum of each coordinate over all the points in the subtree rooted at n.

Similarly to B-trees, R-trees are also parametric. Instead of a branching factor d controlling the minimum number of entries per node, R-trees require their clients to provide two parameters on creation:

M, the maximum number of entries in a node; this value is usually set so that a full node will fit in a page of memory.
m, such that m ≤ M/2, the minimum number of entries in a node. This parameter indirectly controls the minimum height of the tree, as we’ll see.

Given values for these two parameters, R-trees abide by a few invariants:

Every leaf contains between m and M points (except for the root, which can possibly have less than m points).
Each leaf node L has associated a hyper-rectangle R_L, such that R_L is the smallest rectangle containing all the points in the leaf.
Every internal node has between m and M children (except for the root, which can possibly have less than m children).
Each internal node N has associated a bounding (hyper-)rectangle R_N, such that R_N is the smallest rectangle, whose edges are parallel to the Cartesian axes, entirely containing all the bounding rectangles of N’s children.
The root node has at least two children, unless it is a leaf.
All leaves are at the same level.

Property number 6 tells us that R-trees are balanced, while from properties 1 and 3 we can infer that the maximum height of an R-tree containing n points is log_m (n).

On insertion, if any node on the path from the root to the leaf holding the new point becomes larger than M entries, we will have to split it, creating two nodes, each with half the elements.

On removal, if any node becomes smaller than m entries, we will have to merge it with one of its adjacent siblings.

Invariants 2 and 4 require some extra work to be maintained true, but these bounding rectangles defined for each node are needed to allow fast search on the tree.

Before describing how the search methods work, let’s take a closer look at an example of an R-tree in figures 10.3 and 10.4. We will stick to the 2-D case because it is easier to visualize, but as always, you have to imagine that real trees can hold 3-D, 4-D, or even 100-D points.

If we compare figure 10.3 to figure 9.6, showing how a k-d tree organizes the same dataset, it is immediately apparent how the two partitionings are completely different:

R-trees create regions in the Cartesian plane in the shape of rectangles, while k-d trees split the plane along lines.
While k-d trees alternate the dimension along which the split is done, R-trees don’t cycle through dimensions. Rather, at each level the sub-rectangles created can partition their bounding box in any or even all dimensions at the same time.
The bounding rectangles can overlap, both across different sub-trees and even with siblings’ bounding boxes sharing the same parent. However, and this is crucial, no sub-rectangle extends outside its parent’s bounding box.
Each internal node defines a so-called bounding envelope, that for R-trees is the smallest rectangle containing all the bounding envelopes of the node’s children.

Figure 10.4 shows how these properties translate into a tree data structure; here the difference with k-d trees is even more evident!

Each internal node is a list of rectangles (between m and m of them, as mentioned), while leaves are lists of (again, between m and m) points. Each rectangle is effectively determined by its children and could indeed be defined iteratively in terms of its children. For practical reasons such as improving the running time of the search methods, in practice we store the bounding box for each rectangle.

Because the rectangles can only be parallel to the Cartesian axes, they are defined by two of their vertices: two tuples with k coordinates, one tuple for the minimum values of each coordinate, and one for the maximum value of each coordinate.

Notice how unlike k-d trees, an R-tree could handle a non-zero-measure object by simply considering its bounding boxes as special cases of rectangles, as illustrated in figure 10.5.

3. Inserting points in an R-tree

Now, of course, you might legitimately wonder how you get from a raw dataset to the R-tree in figure 10.5. After all, we just presented it and asked you to take it as a given.

Insertion for R-trees is similar to B-trees and has many steps in common with SS-trees, so we won’t duplicate the learning effort with a detailed description here.

At a high level, to insert a new point you will need to follow the following steps:

Find the leaf that should host the new point P. There are three possible cases:
- P lies exactly within one of the leaves’ rectangles, R. Then just add P to R and move to the next step.
- P lies within the overlapping region between two or more leaves’ bounding rectangles. For example, referring to figure 10.6, it might lie in the intersection of R₁₂ and R₁₄. In this case, we need to decide where to add P; the heuristic used to make these decisions will determine the shape of the tree (as an example, one heuristic could be just adding it to the rectangle with fewer elements).
- If P lies outside of all rectangles at the leaves’ level, then we need to find the closest leaf L and add P to it (again, we can use more complex heuristics than just the Euclidean distance to decide).
Add the points to the leaf’s rectangle R, and check how many points it contains afterward:
- If, after the new point is added, the leaf still has at most M points, then we are done.
- Otherwise, we need to split r into two new rectangles, R₁ and R₂, and go to step 3.
Remove r from its parent R_P and add R₁ and R₂ to R_P. If r_p now has more than m children, split it and repeat this step recursively.
- If R was the root, we obviously can’t remove it from its parent; we just create a new root and set R₁ and R₂ as children.

To complete the insertion algorithm outlined here, we need to provide a few heuristics to break ties for overlapping rectangles and to choose the closest rectangle, but even more importantly, we haven’t said anything about how we are going to split a rectangle at points 2 and 3.

This choice, together with the heuristic for choosing the insertion subtree, determines the behavior and shape (not to mention performance) of the R-tree.

Several heuristics have been studied over the years, each one aiming to optimize one or more usages of the tree. The split heuristics can be particularly complicated for internal nodes because we don’t just partition points, but k-dimensional shapes. Figure 10.7 lead to inefficient splits.

Delving into these heuristics is out of the scope of this section; we refer the curious reader to the original paper by Antonin Guttman for a proper description. At this point, though, we can already reveal that the complexity of handling hyper-rectangles and obtaining good splits (and merges, after removals) is one of the main reasons that led to the introduction of SS-trees.

4. Search

Searching for a point or for the nearest neighbor (NN) of a point in R-trees is very similar to what happens in k-d trees. We need to traverse the tree, pruning branches that can’t contain a point, or, for NN search, are certainly further away than the current minimum distance.

Figure 10.8 shows an example of an (unsuccessful) point search on our example R-tree. Remember that an unsuccessful search is the first step for inserting a new point, through which we can find the rectangle (or rectangles, in this case) where we should add the new point.

The search starts at the root, where we compare point P’s coordinates with the boundaries of each rectangle, R and R₂; P can only be within R₂, so this is the only branch we traverse.

At the next step, we go through R₂’s children, R₅ and R₆. Both can contain P, so we need to traverse both branches at this level (as shown by the two curved arrows, leaving R₂ in the bottom half of figure 10.8).

This means we need to go through the children of both rectangles R₅ and R₆, checking from R_n to R₁₄. Of these, only R₁₂ and R₁₄ can contain P, so those are the only rectangles whose points we will check at the last step. Neither contains P, so the search method can return false, and optionally the two leaves’ rectangles that could host P, if inserted.

Nearest neighbor search works similarly, but instead of checking whether a point belongs to each rectangle, it keeps the distance of the current nearest neighbor and checks to see if each rectangle is closer than that (otherwise, it can prune it). This is similar to the rectangular region search in k-d trees, as described in section 9.3.6.

We won’t delve into NN-search for R-trees. Now that you should have a high-level understanding of this data structure, we are ready to move on to its evolution, the SS-tree.

It’s also worth mentioning that R-trees do not guarantee good worst-case performance, but in practice they usually perform better than k-d trees, so they were for a long time the de facto standard for similarity search and indexing of multidimensional datasets.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms

1. A step back: Introducing B-trees

2. From B-Tree to R-tree

3. Inserting points in an R-tree

4. Search

Leave a Reply Cancel reply

1. A step back: Introducing B-trees

2. From B-Tree to R-tree

3. Inserting points in an R-tree

4. Search

Leave a Reply Cancel reply

Login