Clustering: DBSCAN

The paper introducing DBSCAN was published in 1996, presenting a novel approach to address the problem.[1] DBSCAN is an acronym for “Density-based spatial clustering of applications with noise,” and the main difference in the approach with respect to k-means is already clear from its name. While k-means is a centroid-based algorithm, and as such builds clusters as convex sets around points elected as centroids, a density- based algorithm defines clusters as sets of points that are close to each other, close enough that the density of points in any area of a cluster is above a certain threshold. The natural extension of this definition, by the way, introduces the concept of noise (also referred to as outliers) for those points that are in low-density regions. We will for­mally define both categories in a few lines, but first, we still have a few high-level con­siderations on the algorithm.

Like k-means, DBSCAN is a flat hard-clustering algorithm, meaning that each point is assigned to (at most) one cluster (or no cluster, for outliers) with 100% confi­dence, and that all clusters are objects at the same level, no hierarchy of these groups is kept.

In k-means, random initialization of the centroids has a major role in the algo­rithm (with good choices speeding up convergence), so much so that often several random restarts of the algorithm are compared before choosing the best clustering. This isn’t true for DBSCAN, where points are cycled through somewhat randomly. But this has a lower influence, if any, on the final result; therefore, this algorithm can be considered deterministic.

DBSCAN, finally, extends the concept of single-linkage clustering SLC) by intro­ducing a minimum points-density required to consider two points connected to each other. This reduces the single-link chain effect, the worst side effect of SLC, causing inde­pendent clusters connected by a thin line of (noise) points to be mistakenly classified as a single cluster.

1. Directly vs density-reachable

To understand how DBSCAN works, we need to start with a few definitions. Please use fig­ure 12.9 as a reference while going through them to help check your understanding:

  • Point p, in figure 12.9, is said to be a core point because there are at least min- Points points (including p itself) within distance e from it (where, in the exam­ple, minPoints==3).
  • Point q is directly reachable from p because point q is within distance e from p, which is a core point. A point can only be directly reachable from a core point.
  • A point w is reachable (or, equivalently, density-reachable) from a core point (such as p) through a path of core points p=w1, …, wn=w, if each wi+1 is directly reachable from wi. From the previous definition of direct-reachability, it follows that all points in the path, except w, need to be core points.
  • Any two points that are density-reachable from each other are, by definition, in the same cluster.
  • If there is any point r that is not reachable from any other point in the dataset, then r (and all the points like r) is marked as an outlier (or, equivalently, as noise).

The algorithm is built around the concept of core points: each of them has at least a certain number of neighbors within a specific distance. This can be seen from a differ­ent angle as well: core points are points in areas with at least a minimum density.

Core points (such as p, q, and so on in figure 12.9) that are reachable (meaning, adjacent) to each other belong to the same cluster. Why? Because we conjecture that high-density areas (as opposed to the low-density majority of the domain) define clusters. But all points that are within a distance e from a core point p belong to the same cluster as p’s too.

Figure 12.9 Core points, directly reachable points, and reachable points, given a radius e and a threshold minPoints (the minimum number of points in a core region) equal to 3; hence, core points need to have at least two neighbors within distance ϵ
.

2. From definitions to an algorithm

Moving from the definitions in the previous section to an algorithm is surprisingly simple.

For a given point p, we need to check how many of its neighbors lie within a radius e. If there are more than a certain number m, then we mark p as a core point and add its neighbors to the same cluster; otherwise, never mind; we do nothing. Figure 12.10 illustrates this step that is going to be repeated for every point in the dataset, with the help of a set to keep track of the points that we want to process next (in any order) for the current cluster.

Now we have to ask ourselves what happens when we process a point w that is not a core point, but is directly reachable from a core point p. Is it okay if we don’t take any action while processing w?

As you can see from figures 12.9 and 12.10, if p is a core point and w is directly reachable from it, then the distance between w and p must be at most e; therefore, when we check p, we will add all of p’s neighbors within radius e to the same cluster as p, and hence w will end up in the same cluster as p.

What if there are two core points p and q that are density-reachable from each other, and are both reachable from w? Well, by definition, there will be a chain of core points w1 ,…, wn between q and p, so in turn each core point in the path will be added to the same cluster as q, and finally so will p as well when it’s wn’s turn.

What if p and q are core points that are not reachable from each other, but are both reachable from a point w? Can w be a core point?

Let’s reason ad absurdum: suppose that w is reachable from p, a core point, that it is processed before w. Hence, there is a chain of core points p1, …, pn, each reach­able from the previous one, that connects p to w.

Suppose also that w, by the time pn is processed, has already been added to another cluster, different from p’s. This means that there is a core point q, for which w is reach­able from q (and hence there is a chain of core points q1, …, qk, and so on), but p is not reachable from q because they are in different clusters.

Now, w can be a core point, or not a core point.

If w was a core point, then there would be a chain made of the core points q, q1, …, qk, w, p1,… , pn, p, where all points are reachable from each other; therefore, by definition p would be reachable from q, and this goes against our initial hypothesis.

It follows that w can’t be a core point. It must be a non-core point reachable from at least two different core points, a situation illustrated in figure 12.11.

In these cases, reachable points can be added to either cluster, but the difference will be just about a single point. It also means that the two clusters are separated by an area with lower-than-threshold density (although not completely empty).

The final result won’t be influenced by the order in which points are processed, because sooner or later we will discover that all density-reachable points belong to the same cluster. Nevertheless, there is an efficient way and a bad way to process the points. Depending on the order we follow, we might be forced to use different methods to keep track of the clusters.

The bad way is this: If we processed points in a completely random order, then we would need to keep track of the cluster assigned to each point (initially each point being in its own cluster), but we would also need to keep track of which clusters need to be merged (every time we process a core point, we’ll have to try[5] to merge at least minPoints-1 clusters); this becomes complicated to handle and requires an ad hoc data structure, the disjoint set we described in chapter 5.

If, instead, we process the neighbors of each core point p right after finishing with p, as shown in figure 12.10, then we can just build clusters in a sequence, growing each cluster point by point until no further point can be added to it, without any need of merging clusters or keeping track of the history of merges.

By following this order, points like q and p in figure 12.11 will never be added to the same cluster, and it doesn’t really matter to which of them a (so-called) edge point like w is merged. As a matter of fact, edge points are the only points for which DBSCAN is not entirely deterministic, because they can be added to any of the clus­ters from which they are reachable, and the one cluster to which they are eventually added depends on the order used to process the dataset’s points.

There is one last question we need to ask: How many times do we need to iterate DBSCAN’s main loop? That will be exactly once per point. This is completely different from k-means, where we had many iterations of a few steps on the whole dataset. While k-means is a search heuristic adjusting some parameters to move to a local minimum,[6] DBSCAN is a one-pass deterministic algorithm computing the best partitioning (and at the same time identifying outliers) based on the points’ density in different areas.

3. And finally, an implementation

Once we have outlined how the algorithm works at a high level, we are ready to write an implementation of DBSCAN, shown in listing 12.6. A Python implementation is also available on the book’s repo on GitHub.

If you recall, we mentioned in chapter 11 that data structures such as k-d trees and SS- trees are often used in clustering. For k-means and DBSCAN, a multidimensional indexing structure is used to speed up range queries. You might have figured this out from the definition of core points, given that to decide whether a point p is a core point, we need to check how many dataset points there are in its neighborhood within a cer­tain radius from p.

And while for k-means we need to perform nearest neighbors queries, for DBSCAN we will instead run range queries, looking for all the points within a hypersphere.

You can also find a Python implementation on the book’s repo, as well as a Jupy- ter Notebook to experiment with the algorithm.

Obviously, for DBSCAN as well as for k-means, it is possible to use brute-force lin­ear search to find each point’s neighborhood. For k-means, however, the speedup is nice but not vital (usually k-means implementations don’t bother about this), for DBSCAN the performance gain would be dramatic. Can you guess why? Before going further and reading the explanation, try to think about it for a minute.

The difference with DBSCAN is that the search extends on the whole dataset, while for k-means we only look for the closest among k centroids (and usually k << n).

If we have n points in the input dataset, then the difference, given the running time of the whole algorithm, is between O(n2) and O(n*log(n) ), assuming that the range queries take each O(log(n)).22 Just as a reminder, with one million points in our dataset, this means going down from ~1012 (a trillion) operations to ~6*106 (six million).

4. Pros and cons of DBSCAN

We have already mentioned a few characteristics peculiar to DBSCAN that help us overcome some limits of other clustering algorithms like k-means or single-linkage clustering. Let’s quickly review them here:

  • DBSCAN is able to determine the number of clusters (given the hyper-parameters with which it is called), while k-means needs this number to be provided as a parameter.
  • It only takes two parameters, that can also be derived by the domain (possibly through a preliminary scan to collect statistics on the dataset).
  • DBSCAN can handle noise in the datasets by identifying outliers.
  • It can find arbitrarily shaped clusters and can partition non-linearly separable clusters (see figure 12 and compare it to figure 12.6 for k-means).
  • By tuning the minPoints parameter, it is possible to reduce the single-link effect.
  • The algorithm is almost entirely deterministic and the order in which points are processed is mostly irrelevant. A different order can only change the assign­ment of points on the edge of clusters when they are equally close to more than one cluster (as we have seen in figure 12.11).

So much for the good news. As you can imagine, every rose has its thorns, and DBSCAN has some shortcomings as well:

  • DBSCAN is almost entirely deterministic, but not completely. For some applica­tions, it might be a problem if points at the border of two or more clusters are aleatorily assigned to one or another.
  • DBSCAN also suffers from the curse of dimensionality. If the metric used is the Euclidean distance, then as we saw in section 12.2.2, in high-dimensional spaces all neighbors of a point are at the same distance, and the distance function becomes basically useless. Luckily, other metrics can also be used with DBSCAN.
  • If a dataset has areas with different densities, it becomes challenging, or some­times impossible, to choose parameters £ and minPoints such that all clusters will be partitioned correctly. Figure 12.13 shows an example illustrating how the result produced by DBSCAN is sensitive to the choice of parameters, and figure 12.14 shows another example with areas of different density that make it impos­sible to choose a value for epsilon that will cluster both areas properly.
  • Related to this aspect, one of the problems with successfully running this algo­rithm is that it can be challenging to find the best values for its parameters when there is no previous knowledge of the dataset.

Hyperparameter tuning, as it often happens in machine learning, is crucial and not always easy. Setting minPoints is usually straightforward. As a rule of thumb, for d-dimensional datasets, you choose minPoints > d, and values around 2*d are often ideal. For particularly noisy datasets it is also recommended to use larger values of this parameter to strengthen the noise filtering. Conversely, determining the right value for e is often challenging and requires either deep domain knowledge, or extensive tuning. Figure 12.13 shows, by keeping fixed minPoints, how values of e that are too small (for a given dataset) cause an excessive fragmentation of the dataset into small clusters, while values that are too large have the opposite effect, reducing the ability of the algorithm to spot different clusters.

Clearly, the need for such a tuning would bring us back to a situation similar to using k-means, where we needed to know in advance how many clusters we wanted. And while in this two-dimensional example it might seem easy to find what the “right” number of clusters should be, when we move to higher-dimensional spaces, we lose the possibility of using our intuition, and determining the right values of the algo­rithm hyperparameters based on the number of clusters becomes impossible.

In the next couple of sections, we will examine two different ways to cope with hyperparameters and address their issues.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *