Clustering: K-means

Let’s start our discussion with a classic algorithm, one with a history of success going back to the 1950s: k-means, a partitioning algorithm that gathers data in a predetermined number of spherical clusters.

Figure 12.3 illustrates how k-means works. We can break down the algorithm in three high-level steps:

Initialization—Create k random centroids, random points that can or cannot belong to the dataset and will be the centers of the spherical clusters.
Classification —For each point in the dataset, compute the distance to each centroid, and assign the aforementioned point to the closest among the centroids.
Recentering—The points assigned to a centroid form a cluster. For each cluster, compute its center of mass (as described in section 10.3), and then move the cluster’s centroid to the center of mass c
Repeat steps classification and recentering until no point, at step 2, switches to a different cluster, or the maximum number of iterations is reached.

Steps 2-4 of this algorithm are a deterministic heuristic that computes exact distances between points and centroids, and then updates the centroids to the center of mass of each cluster. However, the algorithm qualifies as a Monte Carlo randomized algorithm because of its first step, where we use random initialization. Turns out this step is crucial for the algorithm. The final result, in fact, is heavily influenced by the initial choice of the centroids. A bad choice can slow down convergence, which, in turn, given that the maximum number of iterations is bounded, will likely lead to an early stop and a poor result. And because the algorithm will remove centroids to which no points are assigned—a very bad initial choice with several centroids close to each other—this could lead to an unwanted reduction of the number of centroids in the early stages of the heuristic (see figure 12.4).

To mitigate this issue, in practice k-means is always used with a random-restart strategy: the algorithm is run several times on a dataset, each time with a different random initialization of the centroids, and then results are compared to choose the best clustering (more about this in the last section of the chapter).

Listing 12.1 shows the code for the main method performing k-means clustering. We broke down the main steps into separate functions to get cleaner, more easily maintainable code and we’ll delve into each step in the next pages. You can also check out implementations of the method on the book’s repo on GitHub.

The algorithm can be seen as a search heuristic converging to a (local) optimum, with slightly more complex than normal functions to compute the score and the gradient step. At line #6, we have a stop condition that checks if the algorithm converged. If the classification hasn’t changed in the last step, then the centroids will be the same as in the previous step, so any further iteration will be futile. Since these functions are quite expensive to compute at each step, and convergence is not guaranteed, we add another stop condition by capping the execution to a maximum number of iterations. As mentioned, we abstracted away the logic of the update and score functions in separate methods. But before looking at those, it’s interesting to check out the random initialization step, which is often underestimated. Listing 12.2 shows the implementation we suggest for this function.

While there are several viable alternatives (for instance, randomly drawing each coordinate from the domain’s boundaries, or using actual dataset’s points), a solution that gets us several advantages is randomly perturbating k points casually drawn from the dataset:

First, we don’t have to worry about the domain’s boundaries. If each centroid’s coordinate was generated completely at random, we would have to first scan the dataset to find the acceptable range for each coordinate.
Even paying attention to the dataset’s boundaries, sampled points could end up in sparse or empty regions, and as such they could later be removed. Instead, by uniformly drawing points from the dataset, the centroids will be close to points in the data and . . .
. . . centroids will be drawn with higher probability in areas with higher density of points.
Randomly perturbating the points, however, helps reduce the risk that all the points will be concentrated in the denser areas.

The code in listing 12.2 is also intuitive, though there are a couple of interesting points to raise.

At line #2 we use a generic sampling function that draws n elements from a set without repetition; the details of this method are not important here, and you don’t need to worry about them. Most programming languages will provide such a function in their core libraries,^[2] so it’s unlikely you will have to implement it.^[3]

We can move to the classification step, described in listing 12.3. Method classify- Points is a brute-force search of all the point-centroid pairs, whose goal is finding the closest centroid to each pair. If you’ve read the book thus far, you should by now feel goose bumps when you hear “brute-force” and, as a conditional reflex, think, can we do any better? We’ll talk about this question a couple of sections down the road.

For the moment, let’s see the last helper method we need to implement to complete the k-means algorithm, the one updating centroids, described in listing 12.4.

To update centroids, as we’ve mentioned, we will just compute the center of mass for each cluster. This means that we need to group all points by the centroid they are assigned to, and then for each group, compute the mean of the points’ coordinates.

This pseudo-code implementation doesn’t apply any correction to remove centroids that have no point assigned to them—they are just ignored. The issue with the centroids array is also not tackled; the array is simply initialized to the empty one, assuming it will be resized dynamically while adding new elements to it. But of course, in a real implementation, both issues should be taken care of, according to what each language allows for initialization and resizing of arrays.

If you’d like to take a look at real implementations of k-means, you can find a Python version on the book’s repo on GitHub. At https://github.com/andreaferretti/ kmeans you can also find a nice resource (unrelated to the book) that, like a Rosetta Stone, contains implementations of this algorithm in many programming languages (you can find a version for most of the mainstream languages).

Now that we have described how k-means is implemented, let’s take another look at how it works.

1. Issues with k-means

Figure 12.5 shows the result of applying k-means to an artificial dataset. The dataset has been carefully crafted to represent the ideal situation where k-means excels. The clusters are easily (and linearly⁷) separable, and they all are approximately spherical.

First, let’s start with a positive note about k-means. Figure 12.5 shows how it manages to correctly identify clusters with different density; for instance, the two clusters in the bottom-left corner have a larger average distance between points (and hence a lower density) than the cluster in the top-right corner. This might seem like something to take for granted, but not all clustering algorithms will deal so well with heterogeneous distributions. In the next section, we’ll see why this is a problem for DBSCAN.

That’s all for the good news: you can also see that there are a few points that are not close to any of the spherical clusters. We added some noise to the dataset as well to show one of k-means’ critical issues: it can’t detect outliers, and in fact, as shown in figure 12.5, outlier points are added to the closest clusters. Since centroids are computed as centers of the mass of the clusters, and the mean function is sensitive to outliers, if we don’t filter out outliers before running k-means, the undesirable consequence is that the centroids of the clusters will be “attracted” by outliers away from the best position they could hold. You can see this phenomenon in several clusters in figure 12.5.

The issue with outliers, however, is not the worst problem with k-means. As mentioned, this algorithm can only produce spherical clusters, but unfortunately, in real datasets not all clusters are spherical! Figure 12.6 shows three examples where k-means clustering fails to recognize the best clustering.

Not-linearly separable clusters can’t be approximated with spherical clusters, and as such, k-means can’t separate non-convex clusters like clusters shaped as two concentric rings. Moreover, in all those situations where the clusters’ shape is not spherical and the points can’t be separated correctly using minimum distance from a centroid, k-means will struggle to find good solutions.

Another issue with k-means is that the number of clusters is a hyper-parameter, meaning that the algorithm is not able to determine the right number of clusters automatically,^[5] and instead it takes the number of centroids as an argument. This means that unless we have some insight deriving from domain knowledge and suggesting to us the right number of categories into which we should cluster the dataset, then to find the right number of clusters for a dataset, we will need to run the algorithm multiple times, trying different values for the number of centroids, and comparing the results using some kind of metric (visual comparison is only possible for 2D and 3D datasets). We’ll talk about this in section 12.5.

These issues are certainly limiting, although there are domains in which we can assume or even prove that data can be modeled well with spherical clusters. But even in the best-case scenario or in intermediate situations where a spherical cluster is still a good approximation, there is another issue that can limit the application of this clustering algorithm: the curse of dimensionality.

2. The curse of dimensionality strikes again

We already bumped into the curse of dimensionality when we described k-d trees in chapter 9: a data structure that works well for low-to-medium-dimensional spaces but behaves poorly in high-dimensional spaces.

It’s not a coincidence that we find this issue again for k-means. This algorithm is a search heuristic that minimizes the Euclidean distance of points to cluster’s centroid. In high-dimensions, however

The ratio between volume and surface grows exponentially.
Most points of a uniformly distributed dataset are not close to the center of mass, but rather far away toward the surface of the cluster.
If data is uniformly distributed in a hypercube (a domain where each feature is uniformly distributed in a fixed range), then in high dimensions most points are close to the faces of the hypercube.
Approximating a hypercube with its subscribed hypersphere will leave out most of the points.
To include all the points, we need the hypersphere superscribed to the hypercube, and as we saw in section 10.5.1, the amount of volume wasted grows exponentially with the number of dimensions.
In a d-dimensional space, with d >> 10, under certain reasonable assumptions for data distribution, the nearest neighbor problem becomes ill-defined,^[6] because in high dimensions the ratio between the distance from a target to the nearest and farthest neighbor becomes almost 1. For instance, if points are equally spaced (for example, placed on a grid), then the 2d closest points to any centroid are all at the same distance from it. This means that the nearest neighbor of a point becomes ambiguously defined.

In simple terms, for higher-dimensional datasets, unless the distribution of clusters is exactly spherical, the spheres needed to include all points are so large that they will likely overlap each other for a significant portion of their volume; moreover, for close- to-uniformly distributed datasets and when many centroids are used, the search for the nearest centroid can be inaccurate. This, in turn, leads to slower convergence, because some points can be assigned back and forth to different, almost equally close centroids.

In summary, we need to keep in mind that k-means is a good option only for low- to-medium-dimensional (with at most around 20 dimensions) datasets where we know that clusters can be accurately approximated with hyperspheres.

3. K-means performance analysis

When we do know that a dataset meets the pre-conditions to apply k-means, however, this algorithm is a viable option and produces good results quickly. How quickly? That’s what we are going to ascertain in this section.

Let’s assume we have a dataset with n points, each point belonging to a d-dimensional space (hence each point can be represented as a tuple of d real numbers), and that we would like to partition the points into k different clusters.

If we examine each sub-step in listing 12.1, we can conclude that

The random initialization step takes O(n*d) assignments (d coordinates for each of the n points).
Initializing cluster indices takes O(n) assignments.
The main cycle is repeated m times, where m is the maximum number of iterations allowed.
Assigning points to centroids takes O(k*n*d) operations, since for each point, we need to compute k d-dimensional (squared) distances, and each requires O(d) operations to be computed.
Comparing two points classifications requires O(n) time, but this can be amortized inside the method doing the partitioning, with a careful implementation.
Updating the centroids requires computing, for each centroid, d times (one for each coordinate^[7]) a mean over at most n points, and so a total of O(k*n*d) operations. If we can assume that the points are distributed evenly among clusters, each cluster will at most contain n/k points, and therefore the average worst-case running time becomes O(k*(n/k)*d) = O(n*d).

The running time for the algorithm is therefore O(m*k*n*d), with O (n+k) extra memory needed to store the points classification and the list of centroids.

4. Boosting k-means with k-d trees

When we described the code for k-means, we saw that the partitioning step, where we assign each point to exactly one of the centroids, is a brute-force search among all combinations of points and centroids. Now we wonder, can we speed it up in any way?

In section 10.5.3 we saw how k-means can help SS+-trees’ performance by making them more balanced when used in split heuristics. But is it possible that the opposite is also true? You might see where I’m going with this: Is it possible to replace brute- force search with something more efficient?

If you think about it, for each point we are looking for its nearest neighbor among the set of centroids, and we already know a data structure or two to speed up this search!

But rather than SS+-trees, in this case the context would suggest we could try k-d trees for three reasons. Before reading them, try to pause for a minute and think about why we could prefer k-d trees. If you can’t think of all three reasons, or if you don’t have a completely clear idea about why these reasons hold, you can look at sections 9.4 and 10.1 that explain these concepts in more detail.

The reasons why k-d trees would be better suited than SS+-trees for the nearest centroid search are

The size of the dataset (the centroids) is small: hence there is a very large probability that it will fit into memory. In this case, k-d trees are the best choice if the other two conditions also hold.
The dimension of the search space goes from low to medium. We know this is the case (if we have done our homework!) because k-means also suffer from the curse of dimensionality, and shouldn’t be applied to high-dimensional datasets.
K-d trees can offer a theoretical worst-case upper bound that is better than brute-force search: O(2^d + d*log(k)) for k d-dimensional centroids. SS+-trees, however, can’t offer better-than-linear worst-case running time.

Moreover, the data structure used will have to be recreated from scratch at each iteration of k-means’ main cycle, so we won’t be dealing with a dynamic dataset that would cause a k-d tree to become imbalanced over time.

The fact that we have to create a new dataset at each iteration, at the same time, is one of the biggest cons of using this approach, because we will have to pay this extra
price. Also, we will need O(k) extra memory to store it. It won’t change the asymptotic memory print of the algorithm, but in practice it will be relevant, especially if k, the number of centroids, is high.

In most applications, though, it is reasonable to expect to have k << n. In other words, the number of centroids will be several orders of magnitude smaller than the number of points.

Listing 12.5 shows how we can change the pseudo-code for method classify- Points to use a k-d tree in place of brute-force search.

As you can see, the code is much shorter than the version in listing 12.3, because most of the complexity of the search is now encapsulated in the KdTree class.

Notice that since we would like to find out the index of the centroid closer to each point, we assume that the KdTree object can keep track of the indices of its points in the initialization array, and that we have a query method that returns the index of the closest point, rather than the point itself. It’s not hard to find a workaround if this is not the case. We can keep a hash table associating centroids to their indices and add an extra step to retrieve the index of the centroid returned by the KdTree.

Performance-wise, since creating a k-d tree for the centroids will require O(k*log(k)) steps, each of which can require up to O(d) operations (because we are dealing with d-dimensional points), then the whole classification step will require O(d*k*log(k) + n*2^d + n*d*log(k))= O(n*2^d + d*(n+k)*log(k)) steps, instead of O(n*k*d) operations.

Formally, we can work out the exact conditions for which O(n*2^d + d*(n+k) *log(k)) < O (n*k*d); for our purpose, however, we can just informally use some intuition to see that

n*2^d < n*k*d o d << k
d*(n+k)*log(k) < n*k*d o (n+k)*log(k) < n*k o n < n*k/log(k)-k This can be shown to hold for n > k, but plotting the difference between the two sides shows that for a fixed n, the difference grows as k grows, so the savings are more noticeable when there are many centroids.

The whole algorithm, assuming that clusters are evenly distributed with approximately n/k points each, would then run in O(m*(n*2^d + d* (n+k)*log(k) + n*d)) = O (m*(n*2^d+ d*(n+k) *log (k) )) and so, in theory, and net of the constant multipliers and implementation details, it seems that it could be a good idea to use a data structure such as k-d trees to improve nearest neighbor search.

As we have seen, however, if the theoretical margin is small, sometimes a more complex implementation only beats asymptotically worse solutions for large, or at times very large, inputs. To double-check if, in practice, it is worth it to go through the trouble of creating and searching a k-d tree at each iteration, we ran some profiling that you can check out on the book’s repo.^[8]

Once again, we used a Python implementation to do the comparison. By now it should go without saying, but these results are obviously only significant for this language and this particular implementation. Nevertheless, they do prove something.

For k-d trees, we used SciPy’s implementation provided in module scipy. spatial. If you have read this book’s earlier chapters, you probably remember one of the golden rules we mentioned. Before implementing something yourself from scratch, look to see if there is something trustworthy already available. In this case, SciPy’s implementation is not only likely more reliable and efficient than the version we could write ourselves (since SciPy’s code has been already tested and tuned at length), but it also implements the query method, performing NN search, by returning the index (relative to the order of insertion) of the point returned. That’s exactly what we need for our k-means method, and it will save us from storing extra data to get from points to indices.

Figure 12.7 shows the result of profiling for the Python method cluster_points implemented with brute-force search, and k-d tree’s nearest neighbor.

If you look closely at the four charts in figure 12.7, you can see how the line for the k-d tree implementation is more stable through the various values of k, while the slope of the upper line for the brute-force search algorithm becomes steeper as k grows.

This trend is even more apparent if we look at the data from a different angle, by keeping n fixed and plotting the running time as a function of k, as we do in figure 12.8.

So, we can say that at least in Python, implementing the points partitioning subroutine by using k-d trees provides an advantage over the naive brute-force search.

5. Final remarks on k-means

To conclude this section on k-means, let’s summarize our conclusions:

K-means is a centroid-based hard-clustering method.
If implemented with an auxiliary k-d tree, the running time for the algorithm is O(m*(n*2^d + d*(n+k)*log(k))).

K-means works well on low-to-medium-dimensional data, when cluster shapes are spherical and the number of clusters can be estimated a prion, but works well even if the dataset doesn’t have a homogeneous distribution.
K-means works poorly on high-dimensional data, and when clusters can’t be approximated with hyperspheres.

Now that we have described the progenitor of all clustering algorithms, we need to fast- forward 40 years to the invention of the next algorithm we are going to present.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms