Parallel clustering: Parallelization

Although the RAM model (presented in appendix B) is traditionally single-threaded and algorithm analysis usually focuses on sequential execution and improving the running time of single-process applications, parallelization, when applicable, can allow for tremendous speed-ups, and it should be in the tool belt of every software engineer.

Multi-threading in coding interviews

When it comes to algorithm analysis in coding interviews, you might find that there are diverging opinions on this point. As a personal anecdote, during a round of interviews, I found two interviewers with opposite positions, one considering parallelization “cheating” to solve the problem we were discussing, and another expecting the interviewee to suggest parallelization to solve it. Keep this in mind during your next interview. Of course, it also depends on the specific problems and on where the interviewer wants to lead you, but asking the interviewer about your options for multithreading and parallelization is often a good idea (provided you know what you are talking about!)

To give you an idea of the kind of speed-up we are talking about, I saw an application’s running time go down from 2 hours to less than 5 minutes by leveraging Kubernetes and Airflow to distribute data download and processing into small chunks, instead of processing the same data sequentially. Of course, splitting data and processing each chunk separately is not always possible; it depends on the domain and on the algorithm.

So, this is the point where we ask ourselves, is clustering a domain where we can parallelize execution and get away with it?

Can we break down our datasets and apply the algorithms we discussed in chapter 12 to each partition independently?

1. Parallel vs distributed

Before we get to the point, a disclaimer is due: usually with the term parallel computing we only address computations that run on multiple CPUs on the same system—multithreading, in synthesis. When we think about using multiple CPUs across several machines communicating through a network, then we are instead referring to what’s called distributed computing. Figure 13.1 shows a diagram that illustrates this difference.

Parallel computing is limited by the number of CPUs on a single machine, while distributed computing is a better approach for scaling out systems and processing huge datasets. On the other hand, if a dataset can fit into a single machine’s memory, the parallel computing approach results are sensibly faster, since the processes can communicate through shared memory, while nodes in distributed systems need to exchange information through a network (at the time of writing, the latency is 100ns vs 150ms, so a factor of 10⁶).

In this chapter, we’ll often use the term parallel computer to refer to both. The computational models we present are software abstractions that could run seamlessly on threads on a single machine or on a distributed system, and the only discriminant would be the size of the input and the resources needed, not the algorithms we use.

2. Parallelizing k-means

Let’s now get more specific in order to answer this question: Can we make k-means a parallel algorithm?

Looking at each step of the algorithm separately will help us “divide and conquer” the problem. Please refer to section 12.2 for the description and implementation of k-means.

The first step is initialization, creating an initial guess for the centroids. If this is done completely at random, this step is independent of the dataset and its running time is only proportional to the number k of clusters; therefore, the fully randomized version is not worth parallelizing. Parallelization could be necessary when points are drawn independently from a distribution without replacement. We’ll see how to distribute this step in section 13.3.2.

Step 3, re-centering, computes the center of mass for each cluster. We will tackle this first because each cluster is processed independently and computing the center of mass for a cluster only needs the points in it. We can definitely parallelize this step, with one process for each cluster, so that the execution time will be one of the longest running threads. If the sequential version needs n*d sums and k*d divisions, where n is the number of points in the dataset and d its dimension (the cardinality of each point), assuming a uniform split to clusters (best-case scenario, of course) each process will perform d*n/k additions and d divisions. If all threads finished at the same time and ran at the same speed as the original sequential algorithm, we would obtain a k-fold speed-up.

Step 2, classification, is more complicated to parallelize. In theory, we would need to check all points’ distances in order to assign them to the right centroid. Thinking about this more carefully, though, do we really need all points? If you refer to figure 12.3 in chapter 12, it seems apparent that a point will only switch to a cluster adjacent to its current assignment, and never to one far away. Also, if a centroid o’ moved further away from a second cluster C (assuming the centroid of the second cluster didn’t move), it would be impossible for a point in C to be assigned to o’ . We would need to be very careful, though, with the assumptions we made, so this step would be somewhat more complicated to parallelize.

Even by just parallelizing step 3 of the k-means algorithm, we can obtain a nice speed-up with respect to the sequential version.

Can we do better? Yes, we can, in at least two different ways; to see how, we will need first to introduce a new algorithm and then a game-changing programming model.

3. Canopy clustering

What if we could run a quick, coarse-grained pseudo-clustering, before running any real clustering algorithm, to get an idea of the distribution of data?

Canopy clustering is normally used for this purpose. It groups points into spherical regions (circles in our 2-D examples), like k-means, but unlike it, these regions can overlap, and most points are assigned to more than one region.

The canopy clustering algorithm is faster and simpler than k-means, because it runs in a single pass, doesn’t have to compute the centroids for the canopies (spherical pseudoclusters), and doesn’t compare each point to each centroid; instead, it elects one point in the dataset as the center of each canopy and adds points around it to the canopy.

The algorithm can be made even faster if, instead of the exact distance metric for points in k-dimensional space, a fast approximate metric is used. This gives a less precise result that can be refined by using a proper clustering algorithm as a next step. As we’ll see in the next section, using canopy clustering to bootstrap other algorithms can both speed up convergence and reduce the running time.

Figure 13.2 shows how canopy clustering works through an example, while listing 13.1 describes its pseudo-code.

At a high level, the algorithm can be described by a few simple steps:

Select and remove a random point p from the dataset and initialize a new canopy with it (lines #6-7).
For each remaining point q, check if the distance between p and q is smaller than a threshold T₂ (lines #8-10); if it is, add q to current canopy (line #11).
If said distance is also smaller than a second threshold T₂, remove q from the list of possible canopy centroids (lines #12-13) so it won’t be the center of a new canopy.
Repeat steps 1-3 until no point is left (line #5).

This process produces spherical agglomerates with radius (at most) T₂; the inner radius T₂ identifies the critical distance within which points can be confidently considered to be related to each other (or equivalently, in same cluster). While a point q (at steps 2-3) could have already been added to a different canopy, if it’s within the inner radius of current canopy’s centroid c, we are confident the pair (q, c), is maximally correlated, and even if we chose q as a centroid, we couldn’t form a canopy that better fits q.

If you are wondering why we can rely on these distances and how we can come up with a good value for T₂ (and T₂), you are on top of the main issue: these parameters usually need to be tuned (trying a few different ones and checking the number/quality of the canopies), and sometimes good initial estimates for these distances come from experience and domain knowledge. For instance, if you were clustering geographical data about cellphone cells and you knew that no two cells are further away than a few kilometers, you would have a hint about T₂.

Deciding these values is not as big deal as it may seem; it is not important to come up with the best possible value, but rather to find an acceptable one for both parameters. As we mentioned, this algorithm only provides a coarse-grain clustering.

4. Applying canopy clustering

Canopy clustering is often used as a pre-processing step for k-means, but it can also be used for DBSCAN and OPTICS. As a matter of fact, for those algorithms it has one further advantage. But we’ll get to that in a moment. First, let’s talk about how to combine canopy clustering and k-means.

The easiest way is quite intuitive. We can take the coarse-grained clusters (with overlap) output by canopy clustering, and for each of them compute their center of mass. Because these clusters can overlap each other, some points will belong to more than one of them; therefore, we can’tjust treat these canopies as the result of an iteration of k-means! However, we can use their centroids to bootstrap k-means, replacing the default (random) initialization step with a more balanced choice.

Alternatively, we could run canopy clustering with coarse values of T₁ and T₂, and improve the initialization step by making sure to draw a fraction of the initial centroids from each of these areas. If the canopy clustering returned m<=k pseudo-clusters, draw k/m centroids from each of them. In practical experiments, this bootstrapping provided a relevant speed-up in convergence for k-means.

If you remember what we discussed in chapter 12, DBSCAN (section 12.3) has a weak spot when applied to datasets with non-uniform density, and OPTICS (section 12.4) can partially remedy this issue, at the cost of a heavier computational load and some experimenting with its parameters. Ideally, what we would need is to run DBSCAN independently on areas with different density and tune its parameters (or just the value for e) for each of these areas separately.

Using canopy clustering as a first step can help us with this issue. If we run DBSCAN on each pseudo-cluster separately, we can expect smaller regions to have a more uniform density, and regions with different density to be—likely—assigned to different pre-clusters.

After computing all the clusters for these areas, however, we still aren’t done. Because the pseudo-clusters were (possibly) overlapping, the local clusters might also overlap. Besides checking to see if we should merge clusters that overlap, there is a more subtle effect: as shown in figure 13.3, two non-overlapping clusters could have points that are in each other’s e-neighborhood! So, we need to also check those pseudo-clusters whose hyperspheres are closer to each other than the larger of the values of e used for those areas (in case DBSCAN was called on them with different values for its hyper-parameters).

The good news is that for such canopies, we don’t have to check every combination of points drawn from their Cartesian product,^[5] but only the points in the external ring of each of them, at a distance from the canopy’s center greater or equal to T₂– ϵ

Now the issue is that we can parallelize the execution of DBSCAN on each pseudocluster, but to put together all the results, we need to check all the pairs of clusters produced by each parallel run, and all the (filtered) pairs of points in the Cartesian product between the external rings of these clusters. Do we need to run these checks in a single thread on a single machine, hence going back to sequential execution?

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms