Daniel Łaś Tech Blog - Speed of stream. Nice but slow abstraction? Myths debunked.

From time to time I witness or am involved in the discussion about Java 8 streams performance. The common knowledge is that streams are the compromise between performance and brevity with convenience. I couldn't find any comprehensive set of benchmarks around showing how fast (or slow) streams are. I decided to prepare and run a few tests by myself. Here below I present gathered results.

What and how was tested

I measured performance of several different use cases of forEach application, like applying some function to list of numbers and summing it or filtering and grouping such list - usual programming tasks. The same tasks were implemented with different flavors of Stream and results were compared.

The choice regarding benchmarking tools was easy, battle tested microbenchmark toolkit - JMH maintained under umbrella of OpenJDK. It's very flexible and easy to use.

See the benchmark runner configuration

Most of the benchmarks were run on lists of different sizes: 1000, 10 000,100 000, 1 000 000 to verify how given implementation performs depending on number of elements it has to process.

The JVM arguments passed in every test were: -Xms2G -Xmx2G -XX:+UseG1GC. The JVM version used was: Java HotSpot(TM) 64-Bit Server 1.8.0_201.

Benchmark results are presented on charts in percents of best result within sample size. Charts are normalized in the way that for given list size best score among different implementations is found and every score is divided by it. Best result is 100 %. When you see 50 % score on the chart, it means that this benchmark was half good as the best one

Results are also displayed in tables in number of operations per second unit. Best results are emphasized.

Sum of integers

This is first benchmark comparing performance of summing integer numbers from list.

See benchmark code

You can limit series visible on charts by selecting them from the legend at the top right corner.

And this is results table.

Benchmark	1000	10,000	100,000	1000,000

You can select rows to make results easier to analyze.

It was easy to predict. We have clear winner: forEach , but its advantage was decreasing with larger items numbers, especially when compared to parallel version of collect . Sequential collect is better on 1000 elements, but then parallel version starts to move ahead. reduce don't perform well at all, worst in almost all cases but again, parallel stream with reduce is doing better when sample is bigger.

If greatest part of your program is responsible for summing reasonable big number of integers, go for forEach .

If you think I'm little bit sarcastic here ... you are right. It's very unlikely that this level of performance optimizations matters, especially in the world of REST APIs, Microservices and Event Driven architectures. If we replace Stream.collect() with forEach to sum 100 000 integers obtained via some REST endpoint, we don't profit much. We gain tenths of microseconds compared to seconds required to fetch data from database, serialize, transfer over the wire and deserialize.

Sum of calculations

Let's take more "realistic" task this time: perform some math operation on list of randomly generated floating-point numbers and sum all calculated values.

See benchmark code

Benchmark	1000	10,000	100,000	1000,000

Well, what a surprise! forEach score is more or less the same as of sequential stream based versions. Parallel stream implementations are about 3 - 5 times faster. Actually, such results are quite easy to explain.

There are cost related to Stream creation and threads coordination that can't be mitigated when number of operations to parallelise is small or the operation itself is fast.

We didn't observe such quick benefits of parallel streams in previous benchmark. According to Amdahl's law defining theoretical limits of parallelism - the greater part of our program is run sequentially, less parallel it can be. This is exactly the case for summing list of integers. Two integers addition is very fast operation, so inherently sequential part responsible for threads synchronization prevailed, especially in case of relatively small quantity of items in the list.

The same law caused slightly better performance of reduce operation in this benchmark. reduce contract requires to use binary associative operation, free from side effects and working on immutable objects. Double.sum() falls into this category. If we obey these constraints, no synchronization is required during partial results merge.

Grouping

Grouping some data set is easy, concise and very descriptive with streams therefore is really frequently used, at least in the code I see on a daily basis. This feature is compared with version implemented imperatively using forEach loop in this benchmark. There are plenty of variations that might affect performance: stream can be sequential or parallel, ordered or unordered, we have several options choosing collector. We can use Stream.reduce() operator instead of Stream.collect() or different data structures as grouping containers. That's the reason of multiple tested implementation versions.

The benchmarked task is: take list of randomly generated Double 's, group by result of division by 100 rounded to long . Produced result is of type Map<Long, List<Double>>.

See benchmark code

Benchmark	1000	10,000	100,000	1000,000

Among sequential implementations, forEach based benchmarks are slightly better then streams collected to Map . Using LinkedList usually improves results insensibly but not for collected parallel stream. Actually, there is severe performance loss visible when Collectors.toCollection(LinkedList::new) is used as downstream collector. Parallel stream optimization attempt, simply speaking, failed miserably. It was only Stream.unordered() and Collectors.groupingByConcurrent() added but it caused collectParOpt benchmark to be approximately 10 times slower in every case.

I gave Stream.reuduce() on parallel stream the try here too. To make parallel reduction working properly, we have to use immutable data structures. Java HashMap is mutable. Immutable Map and List from Vavr library came to the rescue but with limited success. To be honest, my expectations towards parallel reduction performance were initially much higher but I shortly realized that expectations were unreasonable. Vavr's Map has effectively constant get and put operations performance characteristics and we have to always call them both, because there is no similar to plain Java's Map.compute() operation available in its API.

I wouldn't like to draw any significant conclusions regarding the reasons standing behind achieved results. There is just not enough data gathered. The question: stream or not to stream seems to have following answers. If we care about microseconds level performance we should use sequential forEach or parallel Stream.collect() . And we'd better skip optimization attempts without testing them in real life scenarios.

Filtering and sorting distinct items

Another frequently used features of Stream are filtering, sorting and distinct items selection. Maybe not all together but separately for sure. This benchmark takes random decimal numbers, filters them by predefined minimal and maximum value, takes distinct values and sorts in natural order.

See benchmark code

Benchmark	1000	10,000	100,000	1000,000

Stream based code is the winner and common pattern reveals: parallel stream are better when given list becomes longer.

Words frequency

This test is inspired by the talk I attended at Geecon 2019 conference where I saw the performance comparison of old school Java code and Java 8 Streams based code. I think that speaker's message was not exactly that Java 8 streams are slow but anyway, presented code triggered my curiosity.

The task is: read text file, split into words, count all words, count number of occurrences of every word, display 10 most frequent words. Frequency is simply word count/total words count.

Original code

This is original code copied from authors Github repository with irrelevant adjustments.

This is original old school Java 2 based implementation:

This is original Java 8 Stream based implementation:

This is additional class used for results display:

This is benchmark code:

Below are benchmark results. This time, charts are not normalized and show just raw results.

Benchmark	Ops/sec

My first thoughts: not very impressive ... But wait, this code looks suspicious:

Is it reasonable to use parallel stream for IO operations?
Is it really fair to use regular expressions based string splitting?
Does it really help using parallel stream whenever possible?
Is it really good idea to create intermediate collection for every operation we perform?

Lets try to get rid of above doubts and use simple adjustments to improve stream performance.

Step 1 - Using string tokenizer stream and sequential streams

This is word frequency counter using sequential streams and different line to words splitter:

We use sequential streams because reading file from disk might not be the best candidate for parallelization. In test conditions, there is no IO performed because operating system is smart enough to keep frequently read files in memory, but even though, stream created by Path.lines() is of unknown size what is strong limitation for parallelism too. Additionally, instead of splitting string and using flat mapping over stream created from strings array, we flat map over stream created from tokens iterator - TokenizerStream . This should be more efficient.

See the TokenizerStream code

Here are java8step1 benchmark results.

Benchmark	Ops/sec

So far so good, we are better then first stream based approach but still below Java 2 based code score. Let's try to do better.

Step 2 - Parallel streams

Using parallel stream with IO operations is disputable but we can definitely transform tokens stream to parallel and unordered without much risk. Let's use parallel streams for all operations except reading lines from file.

Here are the results.

Benchmark	Ops/sec

We are ahead previous competitors now, but not much. Let's try something more.

Step 3 - Less collections in between

Splitting code into well defined functions is the very good practice, but it means more overhead for this kind of tasks. We can easily reduce number of intermediate collections by combining reading from file and word counting into one stream. If we look at previous examples, we can notice that first list List<String> words is just the holder for all words from file. We immediately transform it to Map<String,Long> with Stream.collect() . Later on we use all words list to get total words number. To avoid necessity of holding all words list we, can group words counters into custom Map implementation counting all put operations. Using this approach, we will have total words counter for free and we use 2 streams instead of 3.

Benchmark	Ops/sec

Did you notice that we didn't use parallel streams at all? I did it deliberately to verify what optimizations really matter. Avoiding unnecessary collections creation is enough to gain significant performance boost.

Step 4 - Single pass parallel

If we profit from limiting number of created collections and by using parallel streams separately, let's see what results will be if we combine both improvements.

Benchmark	Ops/sec

This time I'm satisfied. Optimized stream based code is about 3 times faster than original Java 8 example and about 2 times faster than original Java 2 implementation. I must admit that using parallel streams is cheating and CountingMap could be used to optimize Java 2 based code too. On the other side java8step1 did quite well. We already know that Stream requires some additional resources and we benefit more when we give it more time to work on given task. There is also very important lesson from Words Frequency example:

Don't use parallel streams blindly, always think if there is any potential for parallelism.

Conclusions

It looks like I found nothing revealing. Stream can be sometimes faster and sometimes slower than standard loop. I will be frank with you - I'm really glad that Java 8 Stream is good enough in regards of performance. At least in the scope of common enterprise class applications. I'd really miss its beautiful and descriptive abstraction if I had to drop its use due to poor performance. I did my best to perform legitimate benchmarks and I'm convinced by their results. If you are not, my dear reader, please let me know.

There is one more unknown yet left to be discovered. Our parallel streams share one and only ForkJoinPool in single JVM. The common knowledge is that we should be careful about this, but what are the real numbers ...

... I didn't know but I did my homework. Please see benchmark results of parallel streams performance used in simulated multiple user application: Performance of parallel Streams