Similarity search tree

In section 10.2, we saw some of the key properties that influence the shape and performance of R-trees. Let’s recap them here:

The splitting heuristic
The criteria used to choose the sub-tree to add new points (if more than one overlaps)
The distance metric

For R-trees, we assumed that aligned boxes, hyper-rectangles parallel to the Cartesian axes, are used as bounding envelopes for the nodes. If we lift this constraint, the shape of the bounding envelope becomes the fourth property of a more general class of similarity search trees.

And, indeed, at their core, the main difference between R-trees and SS-trees in their most basic versions, is the shape of bounding envelopes. As shown in figure 10.9, this variant (built on R-trees) uses spheres instead of rectangles.

Although it might seem like a small change, there is strong theoric and practical evidence that suggest using spheres reduces the average number of leaves touched by a similarity (nearest neighbor or region) query. We will discuss this point in more depth in section 10.5.1.

Each internal node N is therefore a sphere with a center and a radius. Those two properties are uniquely and completely determined by N’s children. N’s center is, in fact, the centroid of N’s children, and the radius is the maximum distance between the centroid and N’s points.

To be fair, when we said that the only difference between R-trees and SS-trees was the shape of the bounding envelopes, we were guilty of omission. The choice of a different shape for the bounding envelopes also forces us to adopt a different splitting heuristic. In the case of SS-trees, instead of trying to reduce the spheres’ overlap on split, we aim to reduce the variance of each of the newly created nodes; therefore, the original splitting heuristic chooses the dimension with the highest variance and then splits the sorted list of children to reduce variance along that dimension (we’ll see this in more detail in the discussion about insertion in section 10.3.2).

As for R-trees, SS-trees have two parameters, m and m, respectively the minimum and maximum number of children each node (except the root) is allowed to have.

And like R-trees, bounding envelopes in an SS-tree might overlap. To reduce the overlap, some variants like SS+-trees introduce a fifth property (also used in R-tree’s variants like R*-trees), another heuristic used on insert that performs major changes to restructure the tree; we will talk about SS+-trees later in this chapter, but for now we will focus on the implementation of plain SS-trees.

The first step toward a pseudo-implementation for our data structures is, as always, presenting a pseudo-class that models it. In this case, to model an SS-tree we are going to need a class modeling tree nodes. Once we build an SS-tree through its nodes, to access it we just need a pointer to the tree’s root. For convenience, as shown in listing 10.1, we will include this link and the values for parameters m and m in the SsTree class, as well as the dimensionality k of each data entry, and assume all these values are available from each of the tree nodes.

As we have seen, SS-trees (like R-trees) have two different kind of nodes, leaves and internal nodes, that are structurally and behaviorally different. The former stores k-dimensional tuples (references to the points in our dataset), and the latter only has links to its children (which are also nodes of the tree).

To keep things simple and as language-agnostic as possible, we will store both an array of children and an array of points into each node, and a Boolean flag will tell apart leaves from internal nodes. The children array will be empty for leaves and the points array will be empty for internal nodes.

Listing 10.1 The SsTree and SsNode classes

class SsNode

#type tuple(k) centroid

#type float radius

#type SsNode[] children

#type tuple(k)[] points

#type boolean Leaf

function SsNode(leaf, points=[], children=[])

class SsTree #type SsNode root

#type integer m

#type integer M

#type integer k

function SsTree(k, m, M)

Notice how in figure 10.9 we represented our tree nodes as a list of spheres, each of which has a link to a child. We could, of course, add a type SsSphere and keep a link to each sphere’s only child node as a field of this new type. It wouldn’t make a great design, though, and would lead to data duplication (because then both SsNode and SsSphere would hold fields for centroids and radius) and create an unnecessary level of indirection. Just keep in mind that when you look at the diagrams of SS-trees in these pages, what are shown as components of a tree node are actually its children.

One effective alternative to translate this into code in object-oriented programming is to use inheritance, defining a common abstract class (a class that can’t be instantiated to an actual object) or an interface, and two derived classes (one for leaves and one for internal nodes) that share a common data and behavior (defined in the base, abstract class), but are implemented differently. Listing 10.2 shows a possible pseudo-code description of this pattern.

Listing 10.2 Alternative Implementation for SsNode: SsNodeOO

abstract class SsNodeOO

#type tuple(k) centroid

#type float radius

class SsInnerNode: SsNodeOO

#type SsNode[] children

function SsInnerNode(children=[])

class SsLeaf: SsNodeOO

#type tuple(k)[] points

function SsLeaf(points=[])

Although the implementation using inheritance might result in some code duplication and greater effort being required to understand the code, it arguably provides a cleaner solution, removing the logic to choose the type of node that would otherwise be needed in each method of the class.

Although we won’t adopt this example in the rest of the chapter, the zealous reader might use it as a starting point to experiment with implementing SS-trees using this pattern.

1. SS-tree search

Now we are ready to start describing SsNode’s methods. Although it would feel natural to start with insertion (we need to build a tree before searching it, after all), it is also true that as for many tree-based data structures, the first step to insert (or delete) an entry is searching the node where it should be inserted.

Hence, we will need the search method (meant as exact element search) before we can insert a new item. While we will see how this step in the insert method is slightly different from plain search, it will still be easier to describe insertion after we have discussed traversing the tree.

Figures 10.10 and 10.11 show the steps of a call to search on our example SS-tree. To be fair, the SS-tree we’ll use in the rest of the chapter is derived from the one in figure 10.9. You might notice that there are a few more points (the orange stars), a few of the old points have been slightly moved, and we stripped all the points’ labels, replacing them with letters from A to W, in order to remove clutter and have cleaner diagrams. For the same reason, we’ll identify the point to search/insert/delete, in this and the following sections, as Z (to avoid clashes with points already in the tree).

To continue with our image dataset example, suppose that we now would like to check to see if a specific image Z is in our dataset. One option would be comparing Z to all images in the dataset. Comparing two images might require some time (especially if, for instance, all images have the same size, and we can’t do a quick check on any other trivial image property to rule out obviously different pairs). Recalling that our dataset supposedly has tens of thousands of images, if we go this way, we should be prepared to take a long coffee break (or, depending on our hardware, leave our machine working for the night).

But, of course, by now readers must have learned that we shouldn’t despair, because this is the time we provide a better alternative!

And indeed, as we mentioned at the beginning of the chapter, we can create a collection of feature vectors for the images in our dataset, extract the feature vector for Z—let’s call it F_Z—and perform a search in the feature vectors space instead of directly searching the image dataset.

Now, comparing F_Z to tens or hundreds of thousands of other vectors could also be slow and expensive in terms of time, memory, and disk accesses.

If each memory page stored on disk can hold M feature vectors, we would have to perform n/M disk accesses and read n*k float values from disk.

And that’s exactly where an SS-tree comes into play. By using an SS-tree with at most m entries per node, and at least m<M/2, we can reduce the number of pages loaded from disk to⁷ 2*log_M(n), and the number of float values read to ~k*M*log_M(n) .

Listing 10.3 shows the pseudo-code for SS-tree’s search method. We can follow the steps from figures 10.10 and 10.11. Initially node will be the root of our example tree, so not a leaf; we’ll then go directly to line #7 and start cycling through node’s children, in this case S₂ and S₂.

For each of them, we compute the distance between target (point z in the figure) and the spheres’ centroids, as shown in listing 10.4, describing the pseudo-code implementation of method SsNode: :intersectsPoint. Since for both the spheres the computed (Euclidean) distance is smaller than their radii, this means that either (or both) could contain our target point, and therefore we need to traverse both S₁ and S₂ branches.

This is also apparent in figure 10.10, where point Z clearly lies in the intersection of spheres Sj and S₂.

The next couple of steps in figures 10.10 (bottom half) and 10.11 execute the same lines of code, cycling through node’s children until we get to a leaf. It’s worth noting that this implementation will perform a depth-first traversal of the node: it will sequentially follow down to leaves, getting to leaves as fast as possible, back-tracking when needed. For the sake of space, these figures show these paths as they were traversed in parallel, which is totally possible with some modifications to the code (that would, however, be dependent on the programming language of an actual implementation, so we will stick with the simpler and less resource-intensive sequential version).

The method will sometime traverse branches where none of the children might contain the target. That’s the case, for instance, with the node containing S₃ and S₄. The execution will just end up at line #12 of listing 10.3, returning null and backtracking to the caller. It had initially traversed branch S₄; now the for-each loop at line #7 will just move on to branch S₂.

When we finally get to leaves S₁₂-S₁₄, the execution will run the cycle at line #3, where we scan a leaf’s points searching for an exact match. If we find one, we can return the current leaf as the result of the search (we assume the tree doesn’t contain duplicates, of course).

Listing 10.4 shows a simple implementation for the method checking whether a point is within a node’s bounding envelope. As you can see, the implementation is very simple, because it just uses some basic geometry. Notice, however, that the distance function is a structural parameter of the SS-tree; it can be the Euclidean distance in a k-dimensional space, but it can also be a different metric.^[3]

2. Insert

As mentioned, insertion starts with a search step. While for more basic trees, such as binary search trees, an unsuccessful search returns the one and only node where the new item can be added, for SS-trees we have the same issue we briefly discussed in section 10.2 for R-trees: since nodes can and do overlap, there could be more than one leaf where the new point could be added.

This is such a big deal that we mentioned it as the second property determining the SS-tree’s shape. We need to choose a heuristic method to select which branch to traverse, or to select one of the leaves that would already contain the new point.

SS-trees originally used a simple heuristic: at each step, they would select the one branch whose centroid is closest to the point that is being inserted (those rare ties that will be faced can be broken arbitrarily).

This is not always ideal, because it might lead to a situation like the one shown in figure 10.12, where a new point z could be added to a leaf already covering it, and instead ends up in another leaf whose envelope becomes larger to accept Z, and ends up overlapping the other leaf. It is also possible, although unlikely, that the leaf selected is not actually the closest one to the target. Since at each level we traverse only the closest node, if the tree is not well balanced, it might happen that at some point during traversal the method bumps into a skewed sphere, with the center of mass far away from a small leaf—something like S₆ in figure 10.12, whose child S₁₄ lies far away from its center of mass.

On the other hand, using this heuristic greatly simplifies the code and improves the running time. This way, we have a worst-case bound (for this operation) of O(log_m(n)), because we only follow one path from the root to a leaf. If we were to traverse all branches intersecting Z, in the worst case we could be forced to visit all leaves.

Moreover, the code would also become more complicated because we might have to handle differently the cases where no leaf, exactly one leaf, or more than one leaf intersecting Z are found.

So, we will use here the original heuristic described in the SS-tree first paper, shown in listing 10.5. It can be considered a simpler version of the search method described in section 10.3.1, since it will only traverse a single path in the tree. Figure 10.13 shows the difference with a call to the search method for the same tree and point (refer to figures 10.10 and 10.11 for a comparison).

However, listing 10.5 is just meant to illustrate how this traversal works. In the actual insert method, we won’t call it as a separate step, but rather integrate it. That’s because finding the closest leaf is just the first step; we are far from being done with insertion yet, and we might need to backtrack our steps. That’s why we are implementing insert as a recursive function, and each time a sub-call returns, we backtrack on the path from the root to current node.

Suppose, in fact, that we have found that we should add Z to some leaf L, that already contains j points. We know that j > m > 1, so the leaf is not empty, but there could be three very different situations:

If L already contains Z, we don’t do anything, assuming we don’t support duplicates (otherwise, we can refer to the remaining two cases).
j < M—In this case, we add Z to the list of L’s children, recompute the centroid and radius for L, and we are done. This case is shown in figure 10.12, where L==S₁₄. On the left side of the figure, you can see how the centroid and radius of the bounding envelopes are updated as a result of adding Z to S₁₄.
j == M—This is the most complicated case, because if we add another point to L, it will violate the invariant requiring that a leaf holds no more than M points. The only way to solve this is by splitting the leaf’s point into two sets and creating two new leaves that will be added to L’s parent, N. Unfortunately, by doing this we can end up in the same situation as if n already had M children. Again, the only way we can cope with this is by splitting N’s children into two sets (defining two spheres), removing N from its parent P, and adding the two new spheres to P. Obviously, P could also now have M+1 children! Long story short, we need to backtrack to the root, and we can only stop if we get to a node that has less than M children, or if we do get to the root. If we have to split the root, then we will create a new root with just two children, and the height of the tree will grow by 1 (and that’s the only case where this can happen).

Listing 10.6 shows an implementation of the insert method using the cases just described:

The tree traversal, equivalent to the searchParentLeaf method, appears at lines #10 and #11.
Case 1 is handled at line #3, where we return null to let the caller know there is no further action required.
Case 2 corresponds to lines #6 and #18 in the pseudo-code, also resulting in the method returning null.
Case 3, which clearly is the most complicated option, is coded in lines #19 and #20.
Backtracking is handled at lines #12 to #21.

Figures 10.14 and 10.15 illustrate the third case, where we insert a point in a leaf that already contains M points. At a high level, insertion in SS-trees follows B-tree’s algorithm for insert. The only difference is in the way we split nodes (in B-trees the list of elements is just split into two halves). Of course, in B-trees links to children and ordering are also handled differently, as we saw in section 10.2.

In listing 10.6 we used several helper functions^[4] to perform insertion; however, there is still one case that is not handled. What happens when we get to the root and we need to split it?

The reason for not handling this case as part of the method in listing 10.6 is that we would need to update the root of the tree, and this is an operation that needs to be performed on the tree’s class, where we do have access to the root.

Therefore, we will give an explicit implementation of the tree’s method for insert. Remember, we will actually only expose methods defined on the data structure classes (KdTree, SsTree, and so on) and not on the nodes’ classes (such as SsNode), but we usually omit the former’s when they are just wrappers around the nodes’ methods. Look at listing 10.7 to check out how we can handle root splits. Also, let me highlight this again: this code snippet is the only point where our tree’s height grows.

3. Insertion: Variance, means, and projections

Now let’s get into the details of the (many) helper methods we call in listing 10.6, starting with the heuristic method, described in listing 10.8, to find a node’s closest child to a point Z. As mentioned, we will just cycle through a node’s children, compute the distance between their centroids and Z, and choose the bounding envelope that minimizes it.

Figure 10.16 shows what happens when we need to split a leaf. First we recompute the radius and centroid of the leaf after including the new point, and then we also compute the variance of the M+1 points’ coordinates along the k directions of the axis in order to find the direction with the highest variance; this is particularly useful with skewed sets of points, like S₉ in the example, and helps to reduce spheres volume and, in turn, overlap.

If you refer to figure 10.16, you can see how a split along the x axis would have produced two sets with points G and H on one side, and F and Z on the other. Comparing the result with figure 10.14, there is no doubt about which is the best final result!

Of course, the outcome is not always so neat. If the direction of maximum variance is rotated at some angle with respect to the x axis (imagine, for instance, the same points rotated 45° clockwise WRT the leaf s centroid), then neither axis direction will produce the optimal result. On average, however, this simpler solution does help.

So, how do we perform the split? We start with listing 10.9, which describes the method to find the direction with maximum variance. It’s a simple method performing a global maximum search in a linear space.

We need, of course, to compute the variance at each step of the for loop at line #5. Perhaps this is the right time to remind you what variance is and how it is computed. Given a set s of real values, we define its mean y. as the ratio between the sum of the values and their multiplicities:

Once we’ve defined the mean, we can then define the variance (usually denoted as a²) as the mean of the squares of the differences between S’s mean and each of its elements:

So, given a set ofn points P₀..P_n-1, each Pj with coordinates (P₍j,₀₎, P₍j,_k-1)), the formulas for mean and variance along the direction of the i-th axis are

These formulas are easily translatable into code, and in most programming languages you will find an implementation of the method computing variance in core libraries; therefore, we won’t show the pseudo-code here. Instead, let’s see how both functions for variance and mean are used in the updateBoundingEnvelope method (listing 10.10) that computes a node centroid and radius.

This method computes the centroid for a node as the center of mass of its children’s centroids. Remember, for leaves, their children are just the points it contains, while for internal nodes, their children are other nodes.

The center of mass is a k-dimensional point, each of whose coordinates is the mean of the coordinates of all the other children’s centroids.

Once we have the new centroid, we need to update the radius of the node’s bounding envelope. This is defined as the minimum radius for which the bounding envelope includes all the bounding envelopes for the current node’s children; in turn, we can define it as the maximum distance between the current node’s centroid and any point in its children. Figure 10.17 shows how and why these distances are computed for each child: it’s the sum of the distance between the two centroids and the child’s radius (as long as we assume that points have radius==0, this definition also works for leaves).

4. Insertion: Split nodes

We can now move to the implementation of the split method in listing 10.11.

This method looks relatively simple, because most of the leg work is performed by the auxiliary method findSplitIndex, described in listing 10.12.

After finding the direction with maximum variance, we sort points or children (depending on if a node is a leaf or an internal node) based on their coordinates for that same direction, and then, after getting the list of centroids for the node’s entries, we split this list, again along the direction of max variance. We’ll see how to do that in a moment.

Before that, we again ran into the method returning the centroids of the entries within the node’s bounding envelope, so it’s probably the right time to define it! As we mentioned before, the logic of the method is dichotomic:

If the node is a leaf, this means that it returns the points contained in it.
Otherwise it will return the centroids of the node’s children.

Listing 10.13 puts this definition into pseudo-code.

After retrieving the index of the split point, we can actually split the node entries. Now we need two different conditional branches to handle leaves and internal nodes differently: we need to provide to the node constructors the right arguments, depending on the type of node we want to create. Once we have the new nodes constructed, all we need to do is return them.

Hang tight; we aren’t done yet. I know we have been going through this section for a while now, but we’re still missing one piece of the puzzle to finish our implementation of the insert method: the splitPoints helper function.

This method might seem trivial, but it’s actually a bit tricky to get it right. Let’s say it needs at least some thought.

So, let’s first go through an example, and then write some pseudo-code for it! Figure 10.18 illustrates the steps we need to perform such a split. We start with a node containing eight points. We don’t know, and don’t need to know, if those are dataset points or nodes’ centroids; it is irrelevant for this method.

Suppose we have computed the direction of maximum variance and that it is along the y axis; we then need to project the points along this axis, which is equivalent to only considering the y coordinate of the points because of the definition of our coordinate system.

In the diagram we show the projection of the points, since it’s visually more intuitive. For the same reason, we then rotate the axis and the projections 90° clockwise, remove the points’ labels, and index the projected points from left to right. In our code, we would have to sort our points according to the y coordinate (as we saw in listing 10.12), and then we can just consider their indices; an alternative could be using indirect sorting and keeping a table of sorted/unsorted indices, but this would substantially complicate the remaining code.

As shown, we have eight points to split. We can deduce that the parameter M, the maximum number of leaves/children for a tree node, is equal to 7, and thus m, the minimum number of entries, can only be equal to 2 or 3 (technically it could also be 1, but that’s a choice that would produce skewed trees, and usually it’s not even worth implementing these trees if we use m==1).

It’s worth mentioning again that the value for m must be chosen at the time of creation of our SS-Tree, and therefore it is fixed when we call split. Here we are just reasoning about how this choice influences how the splits are performed, and ultimately the structure of the tree.

And indeed, this value is crucial to the split method, because each of the two partitions created will need to have at least m points; therefore, since we are using a single index split,^[7] the possible values for this split index go from m to M-m. In our example, as shown in the middle section of figure 10.18, this means

If m==2, then we can choose any index between 2 and 6 (5 choices).
If m==3, then the alternatives are between 3 and 5 (3 choices).

Now suppose we had chosen m==3. The bottom section of figure 10.18 shows the resulting split for each of the three alternative choices we have for the split index. We will have to choose the one that minimizes variance for both nodes (usually, we minimize the sum of the variances), but we only minimize variance along the direction we perform the split, so in the example we will only compute the variance of the y coordinates of the two sets of points. Unlike with R-trees, we won’t try to minimize the bounding envelopes’ overlap at this stage, although it turns out that reducing variance along the direction that had the highest variance brings us, as an indirect consequence, a reduction of the average overlap of the new nodes.

Also, with SS+-trees, we will tackle the issue of overlapping bounding envelopes separately.

For now, to finish with the insertion method, please look at listing 10.14 for an implementation of the minVarianceSplit method. As mentioned, it’s just a linear search among M – 2* (m-1) possible options for the split index of the points.

And with this, we can finally close this section about SsTree:: insert. You might feel this was a very long road to get here, and you’d be right: this is probably the most complicated code we’ve described so far. Take your time to read the last few sub-sections multiple times, if it helps, and then brace yourself: we are going to delve into the delete method next, which is likely even more complicated.

5. Delete

Like insert, delete in SS-trees is also heavily based on B-tree’s delete. The former is normally considered so complicated that many textbooks skip it altogether (for the sake of space), and implementing it is usually avoided as long as possible. The SS-tree version, of course, is even more complicated than the original one.

But one of the aspects where R-trees and SS-trees overcome k-d trees is that while the latter is guaranteed to be balanced only if initialized on a static dataset, both can remain balanced even when supporting dynamic datasets, with a large volume of insertions and removals. Giving up on delete would therefore mean turning down one of the main reasons we need this data structure.

The first (and easiest) step is finding the point we would like to delete, or better said, finding the leaf that holds that point. While for insert we would only traverse one path to the closest leaf, for delete we are back at the search algorithm described in section 10.3.1; however, as for insert, we will need to perform some backtracking, and hence rather than calling search, we will have to implement the same traversal in this new method.^[8]

Once we have found the right leaf L, assuming we do find the point Z in the tree (otherwise, we wouldn’t need to perform any change), we have a few possible situations—an easy one, a complicated one, and a seriously complicated one:

If the leaf contains more than m points, we just delete Z from L, and update its bounding envelope.
Otherwise, after deleting Z, L will have only m-1 points, and therefore it would violate one of the SS-tree’s invariants. We have a few options to cope with this:
- If L is the root, we are good, and we don’t have to do anything.
- If L has at least one sibling S with more than m points, we can move one point from S to L. Although we will be careful to choose the closest point to L (among all its siblings with at least m+1 points), this operation can potentially cause L’s bounding envelope to expand significantly (if only siblings far away from L have enough points) and unbalance the tree.
- If no sibling of L can “lend” it a point, then we will have to merge L with one of its siblings. Again, we would then have to choose which sibling to merge it with and we might choose different strategies:
  - Choosing the closest sibling
  - Choosing the sibling with larger overlap with L
  - Choosing the sibling that minimizes the coordinates variance (over all axes)

Case 2(c) is clearly the hardest to handle. Case 2(b), however, is relatively easy because, luckily, one difference with B-trees is that the node’s children don’t have to be sorted, so we don’t need to perform rotations when we move one point from S to L. In the middle-bottom sections of figure 10.19 you can see the result of node S₃ “borrowing” one of the S₄ children, S₉—it’s just as easy as that. Of course, the hardest part is deciding which sibling to borrow from and which of its children should be moved.

For case 2(b), merging two nodes will cause their parent to have one less child; we thus have to backtrack and verify that this node still has at least m children. This is shown in the top and middle sections of figure 10.19. The good news is that we can handle internal nodes exactly as we handle leaves, so we can reuse the same logic (and mostly the same code) for leaves and internal nodes.

Cases 1 (at the bottom of figure 10.19) and 2(a) are trivial, and we can easily implement them; the fact that when we get to the root we don’t have to do any extra action (like we have to for insert) makes the SsTree: :delete wrapper method trivial.

Enough with the examples; it’s time to write the body of the delete method, shown in listing 10.15.

As you can see, this method is as complicated as insert (possibly even more complicated!); thus, similarly to what we did for insert, we broke down the delete method using several helper functions to keep it leaner and cleaner.

This time, however, we won’t describe in detail all of the helper methods. All the methods involving finding something “closest to” a node, such as function find- SiblingToMergeTo in listing 10.15, are heuristics that depend on the definition of “closer” that we adopt. As mentioned when describing how delete works, we have a few choices, from shortest distance (which is also easy to implement) to lower overlap.

For the sake of space, we need to leave these implementations (including the choice of the proximity function) to the reader. If you refer to the material presented in this and the previous section, you should be able to easily implement the versions using Euclidean distance as a proximity criterion

So, to complete our description of the delete method, we can start from find- ClosestEntrylnNodesList. Listing 10.16 shows the pseudo-code for the method that is just another linear search within a list of nodes with the goal of finding the closest entry contained in any of the nodes in the list. Notice that we also return the parent node because it will be needed by the caller.

Next, listing 10.17 describes the borrowFromSibling method, which moves an entry (respectively, a point, for leaves, or a child node, for internal nodes) to the node that currently violates the minimum points/children invariant, taking it from one of its siblings. Obviously, we need to choose a sibling that has more than m entries to avoid moving the issue around (the sibling will have one less entry afterward, and we don’t want it to violate the invariant itself!). For this implementation, we will assume that all elements of the list s ibl ings, passed in input, are non-null nodes with at least m+1 entries, and also that siblings is not empty. If you are implementing this code in a language that supports assertions, you might want to add an assert to verify these conditions.

If this condition is met, we want to find the best entry to “steal,” and usually this means the closest one to the destination node, but as mentioned, other criteria can be used. Once we find it, we just need to move the closest entry from the source to the destination of this transaction and update them accordingly.

If, instead, this condition is not met, and there is no sibling of the child violating invariants from which we can borrow an entry, it can mean one of two things:

There are no siblings: assuming m≥2, this can only happen if we are at the root, and it only has one child. In this case, there is nothing to do.
There are siblings, but all of them have exactly m entries. In this case, since m≤M/2, if we merge the invalid node with any of its siblings, we get a new node with 2*m-1<M entries—in other words, a valid node that doesn’t violate any invariant.

Listing 10.18 shows how to handle both situations: we check whether the second argument is null to understand if we are in the former or latter case, and if we do need to perform a merge, we also clean up the parent’s node (which is the one node on which the mergeChildren method is called).

This was the last piece of pseudo-code we were missing for delete. Before wrapping up this section, though, I’d like to exhort the reader to take another look at figure 10.19. The final result is a valid SS-tree that doesn’t violate any of its invariants, but let’s be honest, the result is not that great, right? Now we have one huge sphere, S₃, that is taking up almost all the space in its parent’s bounding envelope and significantly overlapping notjust its sibling, but also its parent’s other branch.

If you remember, this was mentioned as a risk of handling merges in case 2(b) in the description for delete. It is, unfortunately, a common side effect of both node merges and moving nodes/points among siblings; especially when the choice of the entry to move is constrained, a far entry can be picked up for merge/move, and—as in the example in figure 10.19—in the long run, after many deletions, this can make the tree unbalanced.

We need to do better if we would like to keep our tree balanced and its performance acceptable, and in section 10.6 we will see a possible solution: SS+-trees.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms