Parallel clustering: MapReduce

For a long time, engineers have struggled to efficiently parallelize execution of algo­rithms like DBSCAN, at least from a practical point of view, because of both hardware limits and lack of software infrastructure. The most used distributed programming model was GRID computing, until it was realized that a different approach would make computation not just faster, but potentially more robust. That’s when the MapReduce programming model was adopted on a large scale, after being patented and used by Google in the early 2000s.

Although there are several implementations of MapReduce, (or we should say there are several products leveraging MapReduce to provide tools that orchestrate distrib­uted resources to solve tasks, such as Apache Hadoop, Hive, or CloudDB), I believe its main value is in the model it provides, which can be applied to a plethora of tasks.

For this reason, we’ll try to explain how it works through an example.

1. Imagineyou are Donald ..

Imagine Donald Duck dozing on his hammock—as always—on a lazy afternoon, when suddenly a phone ringing louder than normal (and more annoyingly than normal!) wakes him up. He knows before even answering that he is being gently summoned by his lovely uncle Scrooge McDuck, and he needs to rush to the Money Bin—a kind request to which he gladly responds, in order to avoid being disowned (and over­whelmed by debt).

Long story short: as always, his good old Uncle Scrooge has a long, boring task for Donald to attend to. This time, since he is securing the Money Bin main room and has to move all the coins to a different, giant, safe, he wants to take advantage of the situa-tion (surprise, surprise!) and count and catalog all the coins in his Money Bin . . . by the next morning.

We are talking about millions of coins, so it would be humanly impossible to do this on one’s own. When Donald Duck regains his senses (he understandably fainted when Uncle Scrooge broke the news), he figures out that he’ll need all the help he can get, so he runs to Gyro Gearloose’s and convinces him to create a hoard of robo- clones that will be able to learn how to recognize different coins and catalog them.

This step is the “classical” parallelization step: you break the work down (into piles of coins) to several copies of your software (the counting/catalogue routine), and write down, for each pile, what coins you found and how many of them there are. For instance, a machine could produce a list like this:

£1: 1034 pieces

50¢: 53982 pieces

20p: 679 pieces

$1: 11823 pieces

1¢: 321 pieces

So, problem solved? Well . . . not really. Robo-clones are expensive and take time to build, so even a genius like Gyro could only provide a hundred of them by quickly rewiring some not-so-well-behaved robo-waiters he created in one of his experiments. Now they became quite fast at counting money, but each of them has a huge pile of coins, resulting in a long list of coin types with their quantities. Figure 13.4 illustrates the situation: once they’re finished, it’s up to Donald to add up the hundreds of entries in all those hundreds of lists.

After fainting again and being woken up using Uncle Scrooge’s ammonia (he’ll be charged for it, it goes without saying), good old Donald crawls to Gyro’s lab with a des­perate cry for more help.

Unfortunately, Gyro can’t afford to build more counting machines! But he wouldn’t be a genius if he couldn’t solve this problem.

And to do so, he won’t have to build anything; just getting some help and using a dif­ferent algorithm will do. After doing a quick computation in his head, he estimates that there are about two hundred different types of coins. So he rounds up the whole family and gives a task to each of them: they will have to handle five types of coins each, but they won’t have to count them. They will receive a few lists (well, a hundred of them) from the counting machines, but each list will only have five entries for the same five types of coins, together with how many of them the individual counting machine found.

To achieve this, he provides each counting machine with an address to locate each member of the McDuck family—for instance, an email address such as huey .mcduck@duckmail.com—and a dictionary that is the same for each machine and that lists the types of coins handled by each member of the family. To simplify things, we can imagine that each member is assigned all the coins from a single country. For instance, as shown in figure 13.5, Huey could get all US dollars, Dewey all UK sterling pounds, Louie all Euros, and so on. But in a real application, each of them could get any combination of coin denominations.

Figure 13.4 A first attempt to parallelize coin counting. In this configuration, poor Donald still has to sum hundreds of lists with hundreds of entries each

Once a machine is done counting, it goes through the members of the family and sends them each an email with the total number of coins found for each of the types they are responsible for. Then each family member will have to sum the values in each list, for each type of coin, and send the final result to Uncle Scrooge—just a few hun­dred integer additions per duck; a tedious job, maybe, but one that shouldn’t take too long (just make sure not to let Fethry anywhere near a computer!).

For instance, if Daisy Duck is assigned the $1, 50c, 25c, and 1c coins, then all the machines will send her a short list that looks like this:

25¢: 1.034 pieces

$1: 11823 pieces

50¢: 53982 pieces

1¢: 321 pieces

Figure 13.5 shows the shift in paradigm. While before the person who had to make sense of all the lists was the bottleneck of the computation, now, introducing a new intermediate level in the workflow, and breaking down the work so that each entity at this level has a limited amount of work to do makes the difference.

The key, though, is that the results outputted at level 1 can be partitioned in groups and each of these groups can then be handled separately at level 2.

2. First map, then reduce

Time to abandon our cartoon heroes for a more life-like example, where both levels of this parallel computation would be performed by machines. The operation at level 1 is called Map, because it map seach entry in the input dataset (more precisely, in the portion of the dataset handled by a machine) into something else, extracting the information that’s relevant for computing the final result. The mappers in our exam­ple could likely just run a “coin-recognition” software, without keeping a count,[1] and send lists containing unsorted occurrences of coins to the machines at level 2. Some­thing like this:

$100: 1

50¢: 1

$100: 1

$1: 1

25¢: 1

Here, the info extracted by mappers is just the presence of a coin.

Then the machines at level 2 would specialize in counting. Every machine at level 2 would receive all the entries for occurrences of a certain group of coins, and do something with them (counting them, for example, but it could also sum up their values or filter them). This step is therefore called Reduce, because it takes info limited to a homogeneous group of entries and combines (aka reduces) them to get our final result.

As mentioned, the key disadvantage of the classic, “flat” parallel computation is that composing the results of all the parallel threads/processes/machines would be the bottleneck of the whole process. If a single process has to spawn the threads and then get their results and combine them, it will still have to sequentially access the entire dataset at least once, and even if it also parallelizes the process that combines the intermediate results, it still remains a bottleneck, as shown in figure 13.6. On the left half, you can see that for basic parallelism, the “intermediate output” is all sent to the orchestrator that has to gather it and sort it to the machines in layer 2.

In MapReduce, however, every step is intrinsically parallel. Data is already broken down into pieces that can be processed independently and results are routed to the reducers by each mapper on its own, without passing through a central orchestrator.

Technically, the reducers are the ones that read the information from each mapper, while the mappers’ task is to create temporary files for each reducer in a specific loca­tion (different for each reducer—imagine, for instance, that each mapper creates a different folder or a dedicated virtual disk for each reducer).

Besides speed, the MapReduce approach has another advantage: if a machine crashes, that single machine can be replaced without having to restart the whole com­putation. This, in turn, can help us increase availability and reduce latency by allocat­ing redundant resources to preventively cope with malfunctions.

One thing needs to be clear. In MapReduce there is also an orchestrator, a leader node that controls the computation (spinning up the computational nodes, or request­ing existing resources, assigning the input chunks to the mappers, planning how to route intermediate results to reducers, and handling/recovering from errors). The dif­ference from canonical parallelism, though, is that this special node doesn’t actually read the input or compute it, and that intermediate results don’t have to pass through it, so it’s not a bottleneck for computation. An objection could be that the primary node is still a bottleneck for availability, because if the leader node crashes, the computation can’t be completed; however, using replicas (either live copies or primary-replica), we could get availability guarantees through (limited) redundancy.

There are catches, of course. The first one is that not all computations are suitable for the MapReduce model (or for parallel execution altogether). In general, if data entries are somehow connected, and scattered pieces of data influence each other’s contribution to the final result, then parallelization could be impossible: time series are a good example of data that normally needs to be processed sequentially, because the final result depends on the sequence of adjacent data.

For MapReduce, requirements are even higher. In order to gain an advantage from applying it, we need data that can be grouped by some attributes/fields, and that can be reduced for each of these groups separately.

The operation performed in reducers, moreover, must be associative, so that the order in which the intermediate sub-lists are outputted by mappers must not matter.

It’s worth noting that if, instead of cataloging all the coins, we would like to just count how many of them there are (without distinguishing their type) or compute the total value, we wouldn’t need reducers. Each parallel process would just output its total, and then they could be added by a single central process.

The second catch is that there is no centralized entity that splits the work and dis­tributes it evenly to the reducers, so one can get very busy while another waits without anything to do. Going back to our story, for instance, while Daisy Duck will have to worry about US currency, Gladstone Gander is assigned all rare coins from small countries (lucky him), and thus the lists he gets are almost all empty, and he has to perform just a few additions.

3. There is more under the hood

We have seen a few advantages of the MapReduce model, but there is also more to its success that can’t be seen at a high level. In chapter 7, when talking about caches and multi-threading, we discussed locks and synchronization. Every parallel computation with a shared state will need synchronization, be it to aggregate results, break down data, and assign it to computing units (threads or machines), or just check that the processing is complete.

The key advantage of MapReduce is that it intrinsically limits shared state to a min­imum (by embracing functional programming concepts such as immutability and pure functions), providing a programming paradigm, a way to specify the problem, that forces us to state a problem in such a way so as to eliminate shared state and han­dle the synchronization still needed under the hood.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *