Intro to clustering

In recent years, especially from the second half of the first decade of this century, a single branch of AI got so much momentum that now it’s often considered in the media and public opinion as synonymous to AI. I’m talking, of course, about machine learning (ML), which in turn has been lately (since ~2015) increasingly identified with deep learning (DL).

The truth is that deep learning constitutes just a part of machine learning, gather­ing all those models built with deep (intended as “with many layers”) neural networks, and machine learning, in turn, is just a branch of artificial intelligence.

In particular, machine learning is the branch that is focused on developing mathe­matical models that describe a system after learning its characteristics from data.

ML and deep DL can achieve impressive, eye-catching, and sometimes incredible results (at the time of writing you can think of, for instance, life-like artificially gener­ated faces, landscapes, and even movies created by GANs), but they can’t, and neither do they aim to, build an “intelligent” agent—something closer to the romantic idea of an artificial consciousness that cult movies like Short Circuit or War Games gave us. That is rather the goal of general artificial intelligence.

1. Types of learning

The main classification of machine learning models is based on the type of “learning” they perform, in particular focusing on the way we provide feedback to the model during training:

  • Supervised learning—These models are trained on labeled data; that is, for each entry in the training set, there is a label (for classification algorithms) or a value (regression algorithms) that is associated to that entry. Training will tune models’ parameters in order to increase the accuracy of the model in associat­ing the correct class or value to new entries. Examples of SL are object detec­tion (classification) or predictive models to estimate goods’ prices (regression).
  • Reinforcement learning—Rather than providing explicit labels associated to the data, the model performs some kind of task, and only at the end does it receive feedback about the outcome (stating either success or failure). This is one of the most interesting areas of research at the time of writing, and some examples include game theory (for instance, an agent learning to play chess or Go) and many areas of robotics (such as teaching a robotic arm to juggle).
  • Unsupervised learning—This category differs from the first two because in this case the algorithms are not presented with any feedback on data, but their goal is rather to make sense of data by extrapolating its inner (and often hidden) structure. Clustering is the main form of unsupervised learning.

We will obviously focus on unsupervised learning in the rest of the chapter. Clustering algorithms, in fact, take an unlabeled dataset and try to gather as much information as possible about its structure, grouping together similar data points while setting apart dissimilar ones.

Although at first it might seem less intuitive than supervised or reinforcement learning, clustering has several natural applications, and possibly the best way to describe clustering is by exemplifying a few of them:

  • Market segmentation —Take purchase data and find groups of similar customers. Since they behave in the same way, it’s likely that a marketing strategy would work (or fail) consistently across a group (if segmentation is done properly).

An algorithm won’t output labels for the groups; it won’t tell us if a group is “students under 25” and another “mid-aged book writers, with a passion for comics.” It will only gather similar people together, and then data analysts can further examine the groups to understand more about their composition.

  • Finding outliers—Find data that stands out. Depending on the context, it could be noise in a signal, a new species of flowers in a rain forest, or even a new pat­tern or behavior in customer analytics.
  • Preprocessing—Find clusters in the data and process each cluster separately (and possibly in parallel). This obviously provides a speedup, but it also has another side-effect. Since it reduces the max amount of space needed at any single time, when a huge dataset is broken up into smaller pieces, each piece can fit in memory or be processed on a single machine, while the whole dataset can’t. Sometimes you can even use a fast clustering algorithm (for instance, canopy clustering) as a preprocessor step for a slower clustering algorithm.

2. Types of clustering

Clustering is an NP-Hard problem, and as such it is computationally difficult (impos­sible for today’s real datasets) to solve it exactly. Moreover, it is hard to even define objective metrics that can assess the quality of a solution! Figure 12.1 explains this con­cept. For some cluster shapes, our intuition tells us that the two rings should be in dif­ferent clusters, but it’s hard to come up with a metric function that objectively states so (minimum distance, for instance, wouldn’t work). And this becomes even harder for high-dimensional datasets (where our intuition can’t even help us validate these metrics, because it’s hard to represent and interpret anything beyond 3-D spaces).

For these reasons, all clustering algorithms are heuristics that converge more or less quickly to a locally-optimal solution.

Under the category labeled “clustering,” we group several data-partitioning algo­rithms, using approaches completely different from each other. All these approaches can be applied to the problems described in the previous section almost transparently, although, obviously, each approach has strengths and flaws, and we should select the best-fitting algorithms based on our requirements.

A first relevant distinction is made between hard and soft clustering, as also shown in figure 12.2:

  • In hard clustering, to every point the output assigns a single cluster, one and only one (or at most one, if the clustering algorithm can also detect noise).
  • In soft clustering, for each point P and each group G, the output provides a probability that P belongs to G.

The other main criterion to classify clustering algorithms differentiates partitioning clustering from hierarchical clustering.

  • Partitioning clustering, aka flat clustering, outputs a division of the input dataset into partitions, so no cluster is the subset of another, nor does it intersect any other cluster.
  • Hierarchical clustering produces a hierarchy of clusters that can be then inter­preted and “sliced” at any given point, depending on parameters set for the algorithm.

Obviously these two criteria are orthogonal; for instance, you can have hard partitioning clustering algorithms (such as k-means) or soft hierarchical ones (such as OPTICS).

Other criteria often used to classify these algorithms are centroid-based versus density-based, and randomized versus deterministic.

In the next section, we’ll first describe an algorithm for partitioning clustering: k-means, the progenitor of all clustering algorithms. Then we’ll move to a different type of flat clustering, DBSCAN, and finally we’ll pick up the discussion on hierarchi­cal clustering and introduce OPTICS, which is also density-based. Table 12.1 summa­rizes the “identity card” for the three algorithms presented in this chapter.

Don’t worry; in the next sections for each of these properties we will explain in detail what it means and provide examples to make the distinction clearer.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *