Parallel Streams in Java

Streams make it easy to parallelize bulk operations. The process is mostly automatic, but you need to follow a few rules. First of all, you must have a parallel stream. You can get a parallel stream from any collection with the Collection.parallelStream() method:

Stream<String> parallelWords = words.parallelStream();

Moreover, the parallel method converts any sequential stream into a parallel one.

Stream<String> parallelWords = Stream.of(wordArray).parallel();

As long as the stream is in parallel mode when the terminal method executes, all intermediate stream operations will be parallelized.

When stream operations run in parallel, the intent is that the same result is returned as if they had run serially. It is important that the operations are stateless and can be executed in an arbitrary order.

Here is an example of something you cannot do. Suppose you want to count all short words in a stream of strings:

var shortWords = new int[12];

words.paratletStream().forEach(

s -> { if (s.tength() < 12) shortWords[s.tength()]++; });

// ERROR–race condition!

System.out.printtn(Arrays.toString(shortWords));

This is very, very bad code. The function passed to forEach runs concurrently in multiple threads, each updating a shared array. As you saw in Chapter 12 of Volume I, that’s a classic race condition. If you run this program multiple times, you are quite likely to get a different sequence of counts in each run—each of them wrong.

It is your responsibility to ensure that any functions you pass to parallel stream operations are safe to execute in parallel. The best way to do that is to stay away from mutable state. In this example, you can safely parallelize the computation if you group strings by length and count them:

Map<Integer, Long> shortWordCounts

= words.paratletStream()

.fitter(s -> s.tength() < 12)

.cottect(groupingBy(

String::tength,

counting()));

By default, streams that arise from ordered collections (arrays and lists), from ranges, generators, and iterators, or from calling Stream.sorted, are ordered. Results are accumulated in the order of the original elements, and are entirely predictable. If you run the same operations twice, you will get exactly the same results.

Ordering does not preclude efficient parallelization. For example, when computing stream.map(fun), the stream can be partitioned into n segments, each of which is concurrently processed. Then the results are reassembled in order.

Some operations can be more effectively parallelized when the ordering requirement is dropped. By calling the Stream.unordered method, you indicate that you are not interested in ordering. One operation that can benefit from this is Stream.distinct. On an ordered stream, distinct retains the first of all equal elements. That impedes parallelization—the thread processing a segment can’t know which elements to discard until the preceding segment has been
processed. If it is acceptable to retain any of the unique elements, all segments can be processed concurrently (using a shared set to track duplicates).

You can also speed up the limit method by dropping ordering. If you just want any n elements from a stream and you don’t care which ones you get, call

Stream<String> sample = words.parallelStream().unordered().limit(n);

As discussed in Section 1.9, “Collecting into Maps,” on p. 30, merging maps is expensive. For that reason, the Collectors.groupingByConcurrent method uses a shared concurrent map. To benefit from parallelism, the order of the map values will not be the same as the stream order.

Map<Integer, List<String>> result = words.parallelStream().collect(

Collectors.groupingByConcurrent(String::length));

// Values aren’t collected in stream order

Of course, you won’t care if you use a downstream collector that is independent of the ordering, such as

Map<Integer, Long> wordCounts

= words.parallelStream()

.collect(

groupingByConcurrent(

String::length,

counting()));

Don’t turn all your streams into parallel streams in the hope of speeding up operations. Keep these issues in mind:

There is a substantial overhead to parallelization that will only pay off for very large data sets.
Parallelizing a stream is only a win if the underlying data source can be effectively split into multiple parts.
The thread pool that is used by parallel streams can be starved by blocking operations such as file I/O or network access.

Parallel streams work best with huge in-memory collections of data and computationally intensive processing.

The example program in Listing 1.8 demonstrates how to work with parallel streams.