Evaluating clustering results: Evaluation metrics

At this point in the chapter, we have learned about three different (and increasingly complex and effective) clustering algorithms, their strengths, and their weaknesses. Despite being so different from one another, there is one thing that all of these algo­rithms have in common: they require a human to set one or more hyper-parameters in order to achieve the best result.

Setting these parameters manually can be beyond challenging: while with 2-D data­sets it’s possible to take a look at the resulting clustering and decide if it looks good or not, with higher-dimensional datasets we don’t have the luxury of using our intuition.

It is time to define a more formal way to assess the quality of a clustering. It’s finally time to talk about evaluation metrics.

To best develop this discussion, we’ll revive our initial example: customer segmen­tation for our e-commerce site, based on two features, annual income and average monthly bill on the platform. Figure 12.24 shows a synthetic dataset with a realistic dis­tribution (based on data from a real website). The dataset has been preprocessed, normalizing the points. Data massaging is a standard step in data science; it helps make sure that features with larger values have the same weight on the final decision. For instance, in our example, we can reasonably assume that the annual salary is in the range $100K-200K, while the average monthly expenses on the website are some­where between $500-1,000. If that was the case, one feature would have values three orders of magnitude larger than the other, and the dataset would be completely skewed. In other words, if we used Euclidean distance to measure how close two points are, the difference in annual salary would contribute overwhelmingly more than the difference in monthly expenses to the final distance between two customers, making the second feature irrelevant.

To avoid this effect, we perform a normalization step by subtracting the mean of each feature to every single point, and then dividing by the feature’s standard deviation.

On the left side of figure 12.24, we show a possible clustering for the dataset. You don’t need to be a domain expert to see that this appears to be a bad choice for the number of clusters.

Can we write a method or a mathematical formula to express the (poor) quality of this choice?

Let’s start from a consideration: What are the consequences of a bad clustering? Looking at the picture, we can see that points that should belong to different clusters (for example the four clusters in the top-right quadrant) are instead grouped together. As a result, points that are far away from each other are all in the same clus­ter. And if we think about the very definition of k-means, its target is minimizing the squared Euclidean distance of the points to their centroids. When too few centroids are created (as in the example), this will force many points to be assigned to a single centroid, even if far away. Therefore, the average distance of points in each cluster from their centroid will be higher than if we chose the right number of centroids!

If we run the algorithm several times, with different values for the number of cen­troids, we can compute the average (or median) distance of points from their centroids, and choose the lower value. Is that good enough?

Well, we are close, but not quite there. If you think about it, whatever the value of k (the number of centroids) that we tested last, if we choose k+1, with one more cen­troid the average distance will drop a little further because each centroid will have fewer and closer points in its cluster. If we take this to an extreme, by choosing k=n (one centroid per point in the dataset) we will obtain an average distance-to-centroid equal to zero (because each point will have its own centroid).

How can we counter-balance this? The truth is that we can’t really balance it easily, but there is an easy empiric method that helps us choose the best value. We’ll see that in a minute.

First, let’s make another important point clear: the distance of points from cen­troids is only one of many possible metrics. Incidentally, this metric only works for centroid-based clustering algorithms; it wouldn’t be applicable to OPTICS or DBSCAN. A similar metric that would also work for those algorithms is the intra-cluster distance: the average pair-wise distance between points in the same cluster. Another interesting measure, useful when dealing with categorical features, is the total cohe­sion (similar to the distance to cluster center, but using the cosine distance instead of the Euclidean distance).

In the rest of this section, we’ll adopt the intra-cluster average distance as our metric. Now, however, it’s time to reveal the elbow method, an empirical tool that’s used in clus­tering as well as across machine learning to spot the best value for hyper-parameters.

Figure 12.25 shows an example of applying the elbow method to our customers dataset to determine the best number of clusters based on intra-cluster average dis­tance. But it would also be possible to use the same method to decide on the best value for e or minPoints if we were using DBSCAN instead.

From figure 12.25 you can guess where the name of this method comes from: the plot looks like a bent arm, and we want to choose the value corresponding to the elbow, the point where the growth of the function changes drastically.

For our example, that value is k=6. After that, the metric value improves very little. For real datasets, the transition can be less neat (in this case, clusters are very compact and far from each other, so once we reach the optimal number of clusters, the improvement with adding another centroid is almost null), but there is often a point that’s like a watershed: on its left, the slope of the curve is closer to (or larger than) -45°; on its right, it’s closer to 0°.

There are, of course, a few details to take into account to successfully implement this method. First, since k-means is a randomized method, it’s important to run it sev­eral times per value of k. Then you can pick the best value among all the runs (and also store the clustering produced, as we do in the Notebook on the book’s repo), or the average or median value for the metric, depending on your goal.[2] Moreover, you need to carefully choose the best metric for your problem. You might want to mini­mize the distance of points inside one cluster to make sure clusters are as homoge­neous as possible, or, for instance, maximize the distance between different clusters to make sure you have a neat separation between different groups.

Listing 12.10 summarizes the steps that should be performed to successfully apply the elbow method (up to the plotting, not included for obvious reasons). As men­tioned, you can check it out in the Notebook on the book’s repo.

1. Interpreting the results

To check whether the elbow method works, we have stored the results producing the best metric values for each value of k tested (the ones that are plotted in figure 12.25), and we can take a look at some of them in figure 12.26 to verify that our choice makes sense. And indeed, with k=5 we would have too few centroids, and two clusters would be assigned to the same centroid (left chart, bottom-left corner), while with k=7, a sin­gle “natural” cluster gets split by two centroids (right chart, middle).

Once we have established that six clusters are the best choice for our dataset, we can try to interpret the result of the clustering. We can see the top-right corner, for instance, that is made of customers with high annual income and who spend generously on the website. A bit to the right of these clusters, there is another interesting group: people who, despite a lower income, spend almost as much in monthly purchases as the wealth­iest cluster. In the lower left corner, we can see two clusters of people with low incomes that also don’t spend a lot of money on e-commerce. These two groups can either be marketed to together, or further analysis can be performed to understand the differ­ences and target products more appropriately to each group. Considering that they are bringing in limited income, though, the marketing section could rather ask the data sci­ence team to focus on the two clusters toward the center of the chart: middle-class cus­tomers that could be encouraged with targeted campaigns, and that can be asked to fill out surveys that will help the marketing team improve customers’ satisfaction.

We have seen just one example in action of how clustering can help your company thrive, but there are many, many more. Now it’s your turn to apply these powerful techniques to your data!

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *