Daniel Łaś Tech Blog - Performance of parallel Streams. The truth revealed.

This is continuation of my previous post Speed of stream where I compared benchmarks results of various applications of Java Stream API . Please take a look there if you want to know how fast (or slow) Streams are. I ended it with the question whether using parallel streams in a typical multiuser application makes any sense. The common wisdom is that we should use parallel streams in such circumstances with care. Here below you can find results of several benchmarks showing hard evidences why this caution is legitimate.

What's the point?

We already know that Java Streams sometimes are slow and sometimes perform quite well. You can speed up your calculations using parallel streams. Not every type of calculation deserves spending additional time and resources on threads coordination. There is another factor that must be considered in regards of parallel streams usage - every parallel stream started in the JVM uses the common thread pool. If your program is executed as a single process at a time and your operating system hasn't much more to do - go ahead and don't hesitate to use parallel streams.

The situation differs vastly when your parallel stream is not the only one running at the same time.

It's not difficult to find such use cases, let's take web applications for an example ... The problem is the same as always when multiple users compete for shared constrained resource. Some of the competitors have to wait in the queue and the queue management costs more when it's becoming larger.

What and how was tested

I tested three applications of Stream API in sequential and parallel variant:

Sum of calculations
Grouping
Filtering and and sorting distinct items

All of the applications will be described in details later.

Just like previously, I used JMH to run benchmarks and measure the performance. The difference is that tested code was executed in 2, 4, 6, 8, 10,12, 14 and 16 threads at the same time on 8 core machine. Every benchmark was executed against stream of size 1000 and 100 000 to check if the size matters.

See the benchmark runner configuration

The JVM arguments passed in every test were: -Xms2G -Xmx2G -XX:+UseG1GC. The JVM version used was: Java HotSpot(TM) 64-Bit Server 1.8.0_201.

Benchmark results are presented as charts showing number of operations per second of sequential and parallel streams running in 2, 4, 6, 8, 10,12, 14 and 16 threads. The number of operations is the sum of operations executed in every running thread within one second.

If you wonder if this setup reflects the reality of web applications ... this is how modern Non-Blocking I/O based web frameworks work: there is some thread responsible for IO and passing tasks to multiple worker threads waiting in the pool.

Sum of calculations

Here the task is to perform some math operation on list of randomly generated floating-point numbers and sum all calculated values:

Parallel variant differs in the way the stream is created in line 2 - stream() vs. parallelStream():

Please note that I deliberately used reduce operation in line 4. Reduction should parallelize more gracefully because numbers addition and zero value form Monoid. Thanks to this, no synchronization is needed to sum operations results calculated in different threads. If I'd have used collect, partial results would have been added to a single variable that would have to be synchronized.

Below chart show benchmarks results for 1000 long stream.

There are no surprises for sequential stream. Total number of operations grows together with number of cores involved until all available cores are in use. We gain nothing when the number of threads increases above the number of available cores. This is how it is. This is only the simulation of multitasking provided by the operating system called context switching when the number of active tasks exceeds the number of processing units.

Making thread pools size larger will not make your multiuser program more efficient.

Parallel stream behavior looks more interesting - total number of operations seems to be constant regardless of the count of parallel threads. It's not exactly constant but differences are so small that it's safe to assume that they are insignificant. This is happening because it's always common thread pool fully utilized used regardless of number of concurrent users and no additional synchronization is required.

Parallel version was the clear winner up to 4 concurrent threads while 8 cores were available. Starting from 6 threads sequential variant was doing as well or better then parallel one. Simply speaking, coordination costs of concurrent access to shared resource prevails at some point. Sequential streams executed in completely separated threads with no contention point suck all power from the processor.

Results are a little bit different for 100 000 long stream:

Did you notice that parallel stream performed slightly better then sequential one when the number of items in the stream were larger? I find it difficult to explain. Such results were rather obvious while comparing stream and for-each speed - there is some cost associated with stream creation not paid when stream is larger. It looks like in this particular case there is some additional cost bound to parallel stream creation or a benefit related to larger streams processing but I'm not sure what it is.

Grouping

Grouping items using streams is very convenient. This time the benchmark takes list of randomly generated Double 's, group by result of division by 100 rounded to long . Produced result is of type Map<Long, List<Double>>. For grouping we usually use collect instead of reduce operation because there are very handy grouping collectors available in the Stream API. Collecting is a mutable reduction what means that in case of parallel stream there must be additional synchronization performed. There is additional version of grouping collector expecting the thread safety guarantee from grouping container. In our case this is ConcurrentHashMap. The consequence is the unordered results list as a map's value and the chance for a better parallelism.

vs.

Results of shorter streams benchmarks are visible below.

This time the situation is way different. Sequential stream is performing better with growing resources up to 8 threads. Parallel stream gained more speed up to 4 threads, later on it was on the same level but still not better then sequential stream. The concurrent collector used with the parallel stream only made the whole operation slower. It's only advantage was increasing performance even when the number of threads went above available cores count, not a big success considering weak overall performance.

Longer stream again produced slightly different results:

Parallel stream was the fastest one up to 4 threads and processing with the same speed regardless of the used threads count. The parallel stream with concurrent collector weakness is less visible.

It looks like additional synchronization makes parallel streams less usable in the maximum performance context.

Filtering and sorting distinct items

Finally the next commonly used stream operations: filtering, sorting and distinct selection. This benchmark takes random decimal numbers, filters them by predefined minimal and maximum value, takes distinct values and sorts in natural order. Sorting should not cause much a difference because it's performed after the stream (sequential or parallel) is consumed. Distinct selection requires the stream to maintain internal state what means additional synchronization apart from produced items collection in case of a parallel stream.

These are 1000 long stream results:

and 100 000 long stream results:

We already saw trends presented on the above chart but this time trends are even stronger. Additional contention point makes the parallel stream the looser. It had his time processing larger amount of items in 2 threads. Overall computing power of sequential stream was getting better when executed in parallel with the limit of available cores.

So what are lessons learned? Nothing revealing, just that hard physical limits still apply:

you can't trick your computation power by increasing number of worker threads
shared resource blocking is the first class enemy of parallelism

Thank you dear reader for reading my post to the end. I hope you enjoyed it.