So far, we have used the original SS-tree structure, as described in the original paper by White and Jain; SS-trees have been developed to reduce the nodes overlapping, and in turn the number of leaves traversed by a search on the tree, with respect to R-trees and k-d trees.

1. Are SS-trees better?

With respect to k-d trees, the main advantage of this new data structure is that it is self­balancing, so much so that all leaves are at the same height. Also, using bounding envelopes instead of splits parallel to a single axis mitigates the curse of dimensional­ity because unidimensional splits only allow partitioning the search space along one direction at a time.

R-trees also use bounding envelops, but with a different shape: hyper-rectangles instead of hyper-spheres. While hyper-spheres can be stored more efficiently and allow for faster computation of their exact distance, hyper-rectangles can grow asym­metrically in different directions: this allows us to cover a node with a smaller volume, while hyper-spheres, being symmetrical in all directions, generally waste more space, with large regions without any point. And indeed, if you compare figure 10.4 to figure 10.8, you can see that rectangular bounding envelopes fit the data more tightly than the spherical ones of the SS-tree.

On the other hand, it can be proved that the decomposition in spherical regions minimizes the number of leaves traversed.

If we compare the growth of the volume of spheres and cubes in a k-dimensional spaces, for different values of k, given by these formulas

we can see that spheres grow more slowly than cubes, as also shown in figure 10.25.

And if a group of points is uniformly distributed along all directions and shaped as a spherical cluster, then a hyper-sphere is the type of bounding envelope that wastes the lowest volume, as you can see also for a 2-D space in figure 10.23, where a circle of radius r is inscribed in a square of side 2r. If the points are distributed in a circular cluster, then all the areas between the circle and its circumscribed square (highlighted in figure 10.23) are potentially empty, and so wasted.

Experiments have confirmed that SS-trees using spherical bounding envelopes per­form better on datasets uniformly distributed along all directions, while rectangular envelopes work better with skewed datasets.

Neither R-trees nor SS-trees can offer logarithmic worst-case upper bounds for their methods. In the (unlikely, but possible) worst case, all leaves of the tree will be traversed, and there are at most n/m of them. This means that each of the main opera­tions on these data structures can take up to linear time in the size of the dataset. Table 10.1 summarizes their running time, comparing them to k-d trees.

2. Mitigating hyper-sphere limitations

Now this question arises: Is there anything we can do to limit the cons of using spheri­cal bounding envelopes so that we can reap the advantages when we have symmetrical datasets, and limit the disadvantages with skewed ones?

To cope with skewed datasets, we could use ellipsoids instead of spheres, so that the clusters can grow in each direction independently. However, this would complicate search, because we would want to compute the radius along the direction connecting the centroid to the query point, which in the general case won’t lie on any of the axes.

A different approach to reduce the wasted area attempts to reduce the volume of the bounding spheres used. So far we have always used spheres whose center was a group of points’ center of mass, and whose radius was the distance to the furthest point, so that the sphere would cover all points in the cluster. This, however, is not the smallest possible sphere that covers all the points. Figure 10.26 shows an example of such a difference.

Computing this smallest enclosing sphere in higher dimensions is, however, not feasible, because the algorithm to compute the exact values for its center (and radius) is exponential in the number of dimensions.

What can be done, however, is computing an approximation of the smallest enclos­ing sphere, starting with the center of mass of the cluster as an initial guess. At the very high level, the approximation algorithm tries at each iteration to move the center toward the farthest point in the dataset. After each iteration, the maximum distance the point can move is shortened, limited by the span of the previous update, thus ensuring convergence.

We won’t delve deeper into this algorithm in this context; the interested readers can read more about this method, starting, for instance, with this article[3] by Fischer et al. For now, we will move to another way we could improve the balancing of the tree: reducing the node overlap.

3. Improved split heuristic

We saw in section 10.3 that splitting and merging nodes, as well as “borrowing” points/children from siblings, can result in skewed clusters that require bounding envelopes larger than necessary, and increase node overlap.

To counteract this effect, Kurniawati et al., in their work on SS+-trees, introduce a new split heuristic that, instead of only partitioning points along the direction of max­imum variance, tries to find the two groups such that each of them will collect the closest nearby points.

To achieve this result, a variant of the k-means clustering algorithm will be used, with two constraints:

  • The number of clusters will be fixed and equal to 2.
  • The maximum number of points per cluster is bound to

We will talk in more depth about clustering and k-means in chapter 12, so please refer to section 12.2 to see the details of its implementation.

The running time for k-means, with at most j iterations on a dataset with n points, is O(jkn), where, of course, k is the number of centroids.

Since for the split heuristic we have k==2, and the number of points in the node to split is M+1, the running time becomes O(jdM), compared to O(dM) we had for the original split heuristic described in section 10.3. We can therefore trade off the quality of the result for performance by controlling the maximum number of iterations j .

Figure 10.27 shows an example of how impactful this heuristic can be, and why the increase in running time is well worth it.

Although newer, more complex clustering algorithms have been developed during the years (as we’ll see in chapter 12, where we’ll also describe DBSCAN and OPTICS), k-means is still a perfect match for SS-trees, because it naturally produces spherical clusters, each with a centroid equal to the center of mass of the points in the cluster.

4. Reducing overlap

The k-means split heuristic is a powerful tool to reduce node overlapping and keep the tree balanced and search fast, as we were reminded at the beginning of last sec­tion; however, we can unbalance a node while also deleting points, in particular during merge or when we move a point/child across siblings. Moreover, sometimes the overlapping can be the result of several operations on the tree and involve more than just two nodes, or even more than a single level.

Finally, the k-means split heuristic doesn’t have overlap minimization as a crite­rion, and because of the intrinsic behavior of k-means, the heuristic could produce results where a node with larger variance might completely overlap a smaller node.

To illustrate these situations, the top half of figure 10.28 shows several nodes and their parents, with a significant overlap.

To solve such a situation, SS+-trees introduce two new elements:

  1. A check to discover such situations.
  2. A new heuristic that applies k-means to all grandchildren of a node N (no mat­ter whether they are points or other nodes; in the latter case their centroids will be used for clustering). The clusters created will replace N’s children.

To check the overlap, the first thing we need is the formula to compute the volume of intersections of two spheres. Unfortunately, computing the exact overlap of two hyperspheres, in the generic k-dimensional case, requires not just substantial work and good calculus skills to derive the formulas, but also robust computing resources, as it results in a formula that includes an integral whose computation is clearly expen­sive enough to cause you to question its usefulness in a heuristic.

An alternative approach is to check whether one of the two bounding envelopes is completely included in the other. Check that the center of one sphere is contained in the other one, and that the distance of the centroid of the smaller sphere is closer than R-r to the centroid of the larger one, where R is the radius of the larger sphere and, as you might expect, r is the radius of the smaller one.

Variants of this check can set a threshold for the ratio between the actual distance of the two centroids and R-r, using it for an approximation of the overlapping volume as this ratio gets closer to 1.

The reorganization heuristic is then applied if the check’s condition is satisfied. A good understanding of k-means is needed to get into the details of this heuristic, so we’ll skip it in this context, and we encourage readers to refer to chapter 12 for a description of this clustering algorithm. Here, we will use an example to illustrate how the heuristic works.

In the example, the heuristic is called on node S1 and the clustering is run on its grandchildren, points A-L. As mentioned, these could also be centroids of other inter­nal nodes, and the algorithm wouldn’t change.

The result is shown in the bottom half of figure 10.28. You might wonder why there are now only four children in node S1: even if k-means was called with k, the number of initial clusters, equal to five (the number of S1’s children), this clustering algorithm could output fewer than k clusters. If at any time during its second step, points assignment, one of the centroids doesn’t get any point assigned, that centroid just gets removed and the algorithm continues with one less cluster.

Both the check and the reorganization heuristic are resource-consuming; the lat­ter in particular requires O(jMk) comparisons/assignments, if j is the number of max­imum iterations we use in k-means. Therefore, in the original paper it was recommended to check the overlap situation after splitting nodes, but to apply the reorganization infrequently.

We can also easily run the check when an entry is moved across siblings, while it becomes less intuitive when we merge two nodes. In that case, we could always check all pairs of the merged node’s sibling, or—to limit the cost—just sample some pairs.

To limit the number of times we run the reorganization heuristic and avoid run­ning it again on nodes that were recently re-clustered, we can introduce a threshold for the minimum number of points added/removed on the subtree rooted at each node, and only reorganize a node’s children when this threshold is passed. These methods prove to be effective in reducing the variance of the tree, producing more compact nodes.

But I’d like to conclude the discussion about these variants with a piece of advice: start implementing basic SS-trees (at this point, you should be ready to implement your own version), then profile them within your application (like we did for heaps in chapter 2), and only if SS-trees appear to be a bottleneck and improving their running time would reduce your application running time by at least 5-10%, try to implement one or more of the SS+-tree heuristics presented in this section.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *