Clustering: OPTICS

As we saw in the last section, DBSCAN is a powerful algorithm that is able to identify non-linearly separable clusters of any shape; however, it has a weak spot connected to the parameters regulating the density thresholds. In particular, it is hard to find the best value for epsilon, the radius of the core-region determining what points are reachable from each other. When a dataset has areas with different densities, then it becomes even more impossible to find a value that works equally well for the whole dataset (as shown in figure 12.14).

It doesn’t matter if we try to search for this value manually or semi-automatically (see section 12.5). If we stick with the DBSCAN algorithm alone, the algorithm is not enough to handle non-uniform datasets.

It didn’t take too long until computer scientists came up with a new idea that could help in these situations: the key step missing was “Ordering Points To Identify the Clustering Structure,” which the authors turned into an acronym, OPTICS, to name the algorithm they had invented.

When we were discussing DBSCAN, we talked about the order in which points are processed. For DBSCAN, the only thing that matters is that points in the same cluster (that is, core points reachable from each other) are processed together; however, as we have seen, this is more a matter of optimization so that we don’t have to keep a disjoint set and merge clusters when we find a pair of directly reachable points, because the order of processing may only influence the assignment of non-core points on the edge of clusters.

The idea behind OPTICS is that, instead, this order does matter, and in particular it can make sense to keep expanding a “frontier” for the current cluster by adding the unprocessed point that is closest to the cluster (if it is reachable from the cluster).

Figure 12.15 illustrates this concept in a simplified scenario. In order to choose the right point, it’s obviously necessary to keep track of the distances of all undiscovered neighbors of a cluster,^[2] and for this purpose, the authors of the paper introduced two definitions: to each point, they associated a core distance and a reachability distance.

Figure 12.15 Considering the distance of undiscovered points (P₁ to P₄) from (points within) a certain cluster. The key idea in OPTICS is “discovering” these points from the closest to the furthest; in the example, P₃would be the next one to be processed.

1. Definitions

The core distance of a point p is the minimum distance for that point to be considered a core point. Since the definition of a core point, given in section 12.3 for DBSCAN, depends on the parameters e, the radius of dense regions, and minPoints (M for short, in the following formulas), the minimum number of points needed to define a dense region, these definitions also depend on those hyperparameters:

Naturally, if a point is not a core point (and so less than M points are in its e-neighborhood), its core distance is undefined. Conversely, its core distance will be the distance of its M-th nearest neighbor:^[3] if this happens, core distance can be interpreted as the minimum value that we can assign to the parameter ϵ for p to be a core point (given a fixed value for minPoints).

Once we have defined the core distance, we can use it to define a new kind of metric, the reachability distance between two points:

Reachability distance is not symmetric. Given a point q, the reachability distance of p from q can be interpreted as the minimum value of e for which p is density-reachable from q; therefore, it must be at least p’s core distance (for the closest minPoints to p), or the actual distance between p and q. Moreover, if p and q are nearest neighbors (in particular if there is no point closer to p than q), then this value is the smallest value we can assign to e in order for p and q to belong to the same cluster.

2. OPTICS algorithm

Once given these two definitions, we can easily describe the algorithm. Its core idea is similar to DBSCAN, adding points to the current cluster only if they are reachable from points already in the cluster. However, OPTICS is also completely different. For starters, it doesn’t produce just a flat partitioning of the points, but instead it builds a hierarchical clustering. We’ll get back to this in a minute.

OPTICS takes the same parameters as DBSCAN, but e, in this case, has a different meaning. It’s the largest radius that will be considered to determine the core and reachability distance. For this reason (and for disambiguation, as we’ll see), in the rest of the chapter we will refer to this parameter for OPTICS as ϵ_MAX. By using this approach, the algorithm simultaneously constructs density-based clusterings for an infinite number of values of e (all values between 0 and ϵ_MAX), and hence for an infinite number of densities.

In the formulas in the previous sub-section we have seen that these distances are set to undefined if the e-neighborhood of a point doesn’t have enough neighbors. This also means that the reachability distance from a point q (to any other core point) will be undefined if it’s not a core point, with M-1 neighbors within a radius ϵ_MAX.

Therefore, if we set ϵ_MAX to a large value, then we will have fewer points marked as noise and, in a way, we will leave more open options because we will allow larger values for the core distance, and we will have more points with a defined value for reachability distance.

However, a larger radius for the core density areas means that these areas will contain more points, and this makes the algorithm slower.

In the main section of the algorithm, in fact, for each point processed we need to update the reachability distance of all undiscovered points in its e-neighborhood. The more points it contains, the slower the algorithm will be.

As we briefly mentioned, OPTICS’ main cycle grows the current cluster C by adding the closest point to the cluster’s frontier, which is the point with the smallest reachability distance from any point already in C. The cluster is formed in a way similar to DBSCAN’s because all points reachable from the cluster’s seed (the first point processed for a new cluster) are added to it. However, forming these clusters is not the direct objective of OPTICS.

When a new point is “discovered” and processed, its reachability distance from the cluster is set in stone. This cluster-to-point distance is not to be confused with the reachability distance between two points. For a point q and a cluster C, we define the reachability distance of q from C as the minimum of the reachability distance from p to q, for all points p in C:

If when we process a point p, we update the reachability distances of all points in its ^neighborhood, and we keep a priority queue with all the undiscovered points in the neighborhoods of any point in C, then we can be sure that

The reachability distance of all the points at the frontier of the cluster (all the points that are neighbors to at least one point in the cluster) is correctly stored.
The value stored is the smallest of the reachability distances from any of the points already processed to q (although we only care about the current cluster’s).
The top of the queue holds the closest point to cluster C:
- Technically, at the top of the queue we have the point q with an associated value ϵ_q (the smallest value for ϵ for which q would still be reachable from C) such that, considering any other points w in C’s frontier, ϵ_q < ϵ_w.
Therefore, any of the points still to be processed will have at least the same reachability distance from C, or higher.

Listings 12.7 and 12.8 describe the main OPTICS algorithm, while on the book’s repo on GitHub,^[5] you can find a Python implementation of the algorithm and this Jupyter Notebook to experiment with it.

The algorithm’s core (also illustrated in figure 12.16 on a simplified example) consists of a cycle through each point in the dataset. Entries are processed in chunks of contiguous, reachable points (the starting point is chosen either randomly or according to its position in the dataset), and non-core points are “skipped” (similar to what happened in DBSCAN).

The points that are reachable from the ones already processed are kept in a priority queue and extracted according to their reachability distance (the smallest reachability distance from any of the processed points is on top of the queue, as we have already seen).

There is a key piece of code still missing, the updateQueue method. That’s where the reachability distances are actually updated (together with the priority queue). Listing 12.8 bridges this gap. The main purpose of this method is to go through all points q in the e-neighborhood of a point p (the point that is currently being processed) and update q’s reachability distances by checking to see if p is closer to it than any of the other points previously processed.

As you can see from figure 12.16, we use the priority queue to keep track of the (intermediate) reachability distances for points and set these distances in stone only when a point is processed. Moreover, we can see that the reachability distance of the first point to be processed for each of the clusters will certainly be either undefined or larger than ϵ_MAX

I realize that the discussion so far has been quite abstract. To understand the real purpose of computing these values for the reachability distance and the ordering of processing for the dataset, and how they can be used to build a hierarchical clustering, we need to get ahead of ourselves and take a look at figure 12.17, which shows a reachability plot that can be considered the output of OPTICS (and also requires a further step after the main algorithm, as we’ll see in the next section).

3. From reachability distance to clustering

The first thing you should notice in figure 12.17 is that the plot is determined by three parameters: minPoints and ϵ_MAXare the arguments passed to OPTICS, but there is also another value £ that is used as a threshold for the reachability distance. Obviously, as you can imagine, ϵ ≤ ϵ_MAX; this threshold is used in a step that is performed separately on OPTICS’ results (we’ll describe it in a few lines).

Setting a threshold e for the reachability distance means deciding that the radius of the core region around points is equal to e. This is equivalent to running DBSCAN with that particular value as radius.

The reachability plot shown in figure 12.17 is a special plot composed of two interconnected charts. The top one simply shows the final (flat) clustering that we obtain by setting the threshold to ϵ, while the bottom one shows the ordered sequence of reachability distances and explains how the clustering was derived. Clusters and reachability distances are filled with matching colors, so it’s easier to see what points in the top half match the sections in the bottom half (outliers are marked using black, and for the sake of convenience and to properly plot all values, we assign ϵ_MAX instead of undefined to the reachability distance of outliers). By doing so we are, in practice, loosening the requirements for the reachability criterion for these points, but since it’s already the largest possible value that we can assign to the threshold e, it won’t change anything in the following steps.

So, how do we form these clusters, given ϵ ≤ ϵ_MAX? Figure 12.18 illustrates the idea behind this algorithm (at a high level, combining it with elements of OPTICS). We start looking at reachability distances, following the ordering in which points were processed by OPTICS, so that we know that the reachability distance of the next point is the smallest possible among all undiscovered points (keep in mind that the reachability distances stored are points-to-cluster distances, and that’s also why ordering is important!).

We start from the first point processed—call it Pj—and create a new cluster being the first point of a new cluster, P/s reachability distance is undefined, so we still don’t know if Pi is an outlier or part of a cluster. We can only tell after the next step, shown in part (A) of figure 12.18, by checking the reachability distance (from C₁) of the next point P₂.

Since this value is smaller than e (in the example, 0.8), then we add P₂ to C and move to the next point P₃, also added to C as shown in (B) and (C). From a visual point of view, in the reachability chart you can see that the reachability distance for P₂is below the threshold line (parallel to the horizontal axis) for e=0.8.

When we get to P₄, we realize that its reachability distance from C₄ (or technically from any of the points processed before P₄) is larger than e, and hence P₄ is not reachable from C₄. Since the points are processed in order of reachability distance from the current cluster, this means that there is no point still to be processed that can be reachable from points in C₄ if the radius of core regions is equal to e; hence, we “close” the current cluster and start a new one.

We start a new cluster for P₄, but we don’t know how to classify it yet. When we check P5’s reachability distance (D), we discover that it is also larger than e, and this means that the algorithm has detected that P4 is an outlier (because it’s not reachable by any point in the dataset, given parameters minPoints and ϵ ). Notice that the radius that matters here is e. By using a higher value for ϵ_MAX, we have simply instructed OPTICS to filter out as noise only those points with a reachability distance larger than ϵ_MAX, while for points with a reachability distance at most ϵ_MAX, we simply defer the decision to this second step. In practice, this allows us to compute reachability distances only once, and try several values for e (up to ϵ_MAX ) in this second step with minimal computational effort.

We repeat the process one more time (E) for P₆, and since its reachability distance (from P₅) is smaller than e, we know that it’s reachable from P₅ and we can add both to the same cluster (F). If there were more points to process, we would have continued in the same way.

One final remark before moving to the implementation, shown in listing 12.9. Notice that if we run OPTICS twice starting with the same point, the dataset will be processed in the same order, and the results will be identical (net of possible ties on reachability distance). Therefore, OPTICS can be considered as completely deterministic.

One note on the method opticsCluster: In the original paper, the authors presented a slightly different method using points’ core distance to decide if a new cluster should be started. The version presented here stems from the consideration that if the reachability of a point q is undefined or above the threshold e, while point p, the successor to q in the processing order, has a reachability distance that’s below e, then q must be a core point. p’s reachability distance from the set of points processed before q, in fact, must also be undefined or greater than q’s; otherwise p would have been processed before q. Consequently, p’s reachability distance in the plot is the reachability distance between p and q, and according to section 12.4.1, that’s only defined if q is a core point.

4. Hierarchical clustering

Now that we have learned how to produce a flat clustering from OPTICS results, we are in good shape to proficiently use the algorithm. We had mentioned, though, that OPTICS is a hierarchical clustering algorithm. What does that mean and how does it work with OPTICS?

Hierarchical clustering algorithms produce a multi-layer partitioning of a dataset. This result is often represented with a dendrogram, a tree-like structure that holds a stratified structure. I like to think of exploring dendrograms as analogous to coring: we can take a section of the dendrogram and see what the flat clustering associated with that section looks like.

But enough with the abstract analogies; let’s delve into an example to clarify how this works!

Figure 12.19 shows a reachability plot obtained from the same reachability distances and ordering as the one in figure 12.17, but with a different value for e. Using a larger value for this radius, obviously the reachability areas around points are larger and fewer clusters are formed.

Does this clustering make more sense than the one shown in figure 12.17? It’s hard to say just by looking at it; answering the question would probably require some domain knowledge or the tools that we will present in the next section. But since there is now just a single noise point surrounded by three clusters, we might think it makes sense that the points on the left half of the dataset all form a single cluster. To obtain that, we can try a larger value for e, as shown in figure 12.20.

With e==0.7 the goal is reached, but there is a catch: the two clusters on the right are also merged into a single one! This happens because the two peaks in the reachability distance plot, highlighted in figure 12.20, that mark the edges of clusters C1 and C3 in the same figure, have values smaller than 0.7, so these two small clusters are reachable from the biggest ones next to them.

Is there a way to keep C₃ separated from C₄, but C_x merged to C₂? For DBSCAN, we have seen in figure 12.14 that this is not possible. For OPTICS, a flat clustering separating one but not the other would only be possible if there was a value of e smaller than C₃’s threshold but larger than C₄’s. As figure 12.21 illustrates, this is not the case.

We seem to be back to the old issue with DBSCAN: it is not possible to find a single value of e that works for the whole dataset, since there is one “half’ of it (where x<0) that has a sensibly lower density than the rest of the dataset. (You can also see this from the reachability distances: the area, originally in green, on the right, [starting at index 50 on the lower graph] has a dramatically lower average for the reachability distance.) This is where the hierarchical clustering’s added value comes into play; when we run the main OPTICS algorithm, we produce an ordering and a reachability plot. These don’t provide a clustering immediately, but they are (or rather imply) a set of flat clusterings that we can obtain by “cutting” the reachability plot with a specific value of e taken from [0, ϵ_MAX ].

If we try all possible values of epsilon and keep track of the resulting clustering, we can analyze how the partitioning evolves. The best way to perform this analysis is through a dendrogram, a tree-like structure shown in figure 12.22, where—for the sake of clarity—the x axis only shows as lowest-level entries the eight clusters (plus some noise) in figure 12.17, while it would normally have one entry per point in the dataset. Notice how the clusters and noise points are ordered along the x axis of the dendrogram: all points in this plot must follow the same order as in the reachability plot (which, in turn, is the same order points are processed by OPTICS).

Now, looking at a dendrogram you can see why this is called hierarchical clustering: it keeps track of a hierarchy of clusters going (top to bottom) from a single cluster containing the whole dataset, to N singletons, which are proto-clusters with a single dataset point in them. Moving from the top of the dendrogram to its bottom means virtually exploring all the possible values for e, and the flat clusterings associated with those values: when, in figures 12.17, 12.19, and so on, we chose ϵ=0.5 or ϵ=0.6. We were figuratively cutting a section of the dendrogram and taking a peak at the clusters formed at that level. As we have seen, though, cutting such a section with a line perpendicular to the e axis (meaning a line where e is constant through the whole dataset) doesn’t work for our non-uniform dataset.

The great thing about having a hierarchy of clusters, though, is that we don’t have to cut this dendrogram at the same height for the whole dataset! In other words, we can use different values of e in different branches of the dendrogram. How do we decide which branches and what values? Well, with 2-D datasets, we can be guided by our intuition, but in higher dimensions, it can stem from domain knowledge or be derived by some metric. For instance, in our example, we can consider the partitioning after the first split in the dendrogram, C_E and C_F, and compare their average density. Since density is clearly very different between the two branches, we can come up with two different values for e in each of them, based on the statistics of each subset: higher where density is lower (or, equivalently, where the average reachability distance is higher).

If the result is not yet satisfactory, we can keep traversing each branch of the dendrogram repeating this step, until one of the following happens:

We reach a point were both branches have similar characteristics.
We get the desired number of clusters (if we have an idea from domain knowledge).
We are satisfied by the result of some metric we have defined (see the next section).
Or we chose a threshold for the max depth we can traverse the tree, and we reach it.

Figure 12.23 shows how this would work for our example, assuming we are satisfied with traversing the dendrogram only up to the first split. You can see that now, instead of a segment, we have a step function cutting through the reachability plot and the dendrogram. In figure 12.23, while we only retain three clusters, C_F, C₆, and C₇, we have left the border of all C_F’s sub-clusters visible in the chart.

5. Performance analysis and final considerations

Hierarchical clustering is powerful, but also resource-consuming, compared to flat clustering. While it is estimated in the original paper that the core OPTICS algorithm runs approximately 1.6 times slower than DBSCAN (on the same datasets), keeping the hierarchical clustering and building and exploring the dendrogram obviously also require extra memory and computation.

For the core algorithm, a quick glance at listing 12.6 and 12.7 shows us that the code processes each point exactly once (because each point is added to and removed from the priority queue once), and runs one region-query for each point processed; however, entries in the priority queue can be updated multiple times, potentially each time a point is processed. The size of the priority queue depends on the size of the dense regions, and the larger ϵ_MAX, the more points will be in the queue. Potentially, the queue could contain all the points starting from iteration 1 (if ϵ_MAX > max pairwise distance), and all points in the queue could be updated each time a new point is processed. Likewise, the time needed for nearest neighbor searches, even if using worst-case bounded structures such as k-d trees, depends on ϵ_MAX, and if the radius of the region searched is large enough, these queries become linear scans of the dataset.

For these reasons, the worst-case running time of the algorithm is quadratic! Nevertheless, it can be shown that with an appropriate choice for ϵ_MAX, the nearest neighbor search can be performed in amortized logarithmic time, and similarly the size of the priority queue can be bound. Therefore, with a wise choice of ϵ_MAX, the average running time can be as low as O (n*log(n)) for a dataset with n points.

The original paper contains a more formal description of the algorithm and the theory behind it, and an automated procedure to build a hierarchical clustering structure from the reachability plot. It’s a good starting point if you’d like to deepen your understanding of this algorithm.

For more interesting reading, see the paper on DeLiClu, an advanced algorithm that extends OPTICS with ideas from single-linkage clustering, allowing us to avoid the parameter ϵ_MAX altogether, and at the same time optimizing the algorithm and improving its running time.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms