From time to time I witness or am involved in the discussion about Java 8 streams performance. The common knowledge is that streams are the compromise between performance and brevity with convenience. I couldn't find any comprehensive set of benchmarks around showing how fast (or slow) streams are. I decided to prepare and run a few tests by myself. Here below I present gathered results.
What and how was tested
I measured performance of several different use cases of
forEach application, like applying some function to list of
numbers and summing it or filtering and grouping such list - usual programming tasks. The same tasks were implemented with different
Stream and results were compared.
The choice regarding benchmarking tools was easy, battle tested microbenchmark toolkit - JMH maintained under umbrella of OpenJDK. It's very flexible and easy to use.
Most of the benchmarks were run on lists of different sizes: 1000, 10 000,100 000, 1 000 000 to verify how given implementation performs depending on number of elements it has to process.
The JVM arguments passed in every test were: -Xms2G -Xmx2G -XX:+UseG1GC. The JVM version used was: Java HotSpot(TM) 64-Bit Server 1.8.0_201.
Benchmark results are presented on charts in percents of best result within sample size. Charts are normalized in the way that for given list size best score among different implementations is found and every score is divided by it. Best result is 100 %. When you see 50 % score on the chart, it means that this benchmark was half good as the best one
Results are also displayed in tables in number of operations per second unit. Best results are emphasized.
Sum of integers
This is first benchmark comparing performance of summing integer numbers from list.
You can limit series visible on charts by selecting them from the legend at the top right corner.
And this is results table.
You can select rows to make results easier to analyze.
It was easy to predict. We have clear winner:
forEach , but its advantage was decreasing with larger items numbers,
especially when compared to parallel version of
collect . Sequential
collect is better on 1000
elements, but then parallel version starts to move ahead.
reduce don't perform well at all, worst in almost all cases but
again, parallel stream with
reduce is doing better when sample is bigger.
If greatest part of your program is responsible for summing reasonable big number of integers, go for
If you think I'm little bit sarcastic here ... you are right. It's very unlikely that this level of performance optimizations matters,
especially in the world of REST APIs, Microservices and Event Driven architectures. If we replace
forEach to sum 100 000 integers obtained via some REST endpoint, we don't profit much.
We gain tenths of microseconds compared to seconds required to fetch data from database, serialize, transfer over the wire and
Sum of calculations
Let's take more "realistic" task this time: perform some math operation on list of randomly generated floating-point numbers and sum all calculated values.
Well, what a surprise!
forEach score is more or less the same as of sequential stream based versions. Parallel stream
implementations are about 3 - 5 times faster. Actually, such results are quite easy to explain.
There are cost related to
Stream creation and threads coordination that can't be mitigated when number of operations to
parallelise is small or the operation itself is fast.
We didn't observe such quick benefits of parallel streams in previous benchmark. According to Amdahl's law defining theoretical limits of parallelism - the greater part of our program is run sequentially, less parallel it can be. This is exactly the case for summing list of integers. Two integers addition is very fast operation, so inherently sequential part responsible for threads synchronization prevailed, especially in case of relatively small quantity of items in the list.
The same law caused slightly better performance of
reduce operation in this benchmark.
requires to use binary associative operation, free from side effects and working on immutable objects.
into this category. If we obey these constraints, no synchronization is required during partial results merge.
Grouping some data set is easy, concise and very descriptive with streams therefore is really frequently used, at least in the code I
see on a daily basis. This feature is compared with version implemented imperatively using
forEach loop in this
benchmark. There are plenty of variations that might affect performance: stream can be sequential or parallel, ordered or unordered,
we have several options choosing collector. We can use
Stream.reduce() operator instead of
or different data structures as grouping containers. That's the reason of multiple tested implementation versions.
The benchmarked task is: take list of randomly generated
Double 's, group by result of division by 100 rounded to
. Produced result is of type
Among sequential implementations,
forEach based benchmarks are slightly better then streams collected to
LinkedList usually improves results insensibly but not for collected parallel stream. Actually, there is severe
performance loss visible when
Collectors.toCollection(LinkedList::new) is used as downstream collector. Parallel stream
optimization attempt, simply speaking, failed miserably. It was only
added but it caused collectParOpt benchmark to be approximately 10 times slower in every case.
Stream.reuduce() on parallel stream the try here too. To make parallel reduction working properly, we have to use
immutable data structures. Java
HashMap is mutable. Immutable
List from Vavr library came to the rescue but with limited success. To be honest, my expectations towards
parallel reduction performance were initially much higher but I shortly realized that expectations were unreasonable. Vavr's
has effectively constant
put operations performance characteristics and we have to always
call them both, because there is no similar to plain Java's
Map.compute() operation available in its API.
I wouldn't like to draw any significant conclusions regarding the reasons standing behind achieved results. There is just not enough
data gathered. The question: stream or not to stream seems to have following answers. If we care about microseconds level
performance we should use sequential
forEach or parallel
Stream.collect() . And we'd better skip
optimization attempts without testing them in real life scenarios.
Filtering and sorting distinct items
Another frequently used features of
Stream are filtering, sorting and distinct items selection. Maybe not all together
but separately for sure. This benchmark takes random decimal numbers, filters them by predefined minimal and maximum value, takes
distinct values and sorts in natural order.
Stream based code is the winner and common pattern reveals: parallel stream are better when given list becomes longer.
This test is inspired by the talk I attended at Geecon 2019 conference where I saw the performance comparison of old school Java code and Java 8 Streams based code. I think that speaker's message was not exactly that Java 8 streams are slow but anyway, presented code triggered my curiosity.
The task is: read text file, split into words, count all words, count number of occurrences of every word, display 10 most frequent words. Frequency is simply word count/total words count.
This is original code copied from authors Github repository with irrelevant adjustments.
This is original old school Java 2 based implementation:
This is original Java 8 Stream based implementation:
This is additional class used for results display:
This is benchmark code:
Below are benchmark results. This time, charts are not normalized and show just raw results.
My first thoughts: not very impressive ... But wait, this code looks suspicious:
- Is it reasonable to use parallel stream for IO operations?
- Is it really fair to use regular expressions based string splitting?
- Does it really help using parallel stream whenever possible?
- Is it really good idea to create intermediate collection for every operation we perform?
Lets try to get rid of above doubts and use simple adjustments to improve stream performance.
Step 1 - Using string tokenizer stream and sequential streams
This is word frequency counter using sequential streams and different line to words splitter:
We use sequential streams because reading file from disk might not be the best candidate for parallelization. In test conditions,
there is no IO performed because operating system is smart enough to keep frequently read files in memory, but even though, stream
Path.lines() is of unknown size what is strong limitation for parallelism too. Additionally, instead of
splitting string and using flat mapping over stream created from strings array, we flat map over stream created from tokens iterator -
TokenizerStream . This should be more efficient.
java8step1 benchmark results.
So far so good, we are better then first stream based approach but still below Java 2 based code score. Let's try to do better.
Step 2 - Parallel streams
Using parallel stream with IO operations is disputable but we can definitely transform tokens stream to parallel and unordered without much risk. Let's use parallel streams for all operations except reading lines from file.
Here are the results.
We are ahead previous competitors now, but not much. Let's try something more.
Step 3 - Less collections in between
Splitting code into well defined functions is the very good practice, but it means more overhead for this kind of tasks. We can easily
reduce number of intermediate collections by combining reading from file and word counting into one stream. If we look at previous
examples, we can notice that first list
List<String> words is just the holder for all words from file. We
immediately transform it to
Stream.collect() . Later on we use all words list to
get total words number. To avoid necessity of holding all words list we, can group words counters into custom
implementation counting all
put operations. Using this approach, we will have total words counter for free and we use 2
streams instead of 3.
Did you notice that we didn't use parallel streams at all? I did it deliberately to verify what optimizations really matter. Avoiding unnecessary collections creation is enough to gain significant performance boost.
Step 4 - Single pass parallel
If we profit from limiting number of created collections and by using parallel streams separately, let's see what results will be if we combine both improvements.
This time I'm satisfied. Optimized stream based code is about 3 times faster than original Java 8 example and about 2 times faster
than original Java 2 implementation. I must admit that using parallel streams is cheating and
CountingMap could be used
to optimize Java 2 based code too. On the other side
java8step1 did quite well. We already know that
requires some additional resources and we benefit more when we give it more time to work on given task. There is also very important
lesson from Words Frequency example:
Don't use parallel streams blindly, always think if there is any potential for parallelism.
It looks like I found nothing revealing. Stream can be sometimes faster and sometimes slower than standard loop. I will be frank with
you - I'm really glad that Java 8
Stream is good enough in regards of performance. At least in the scope of common
enterprise class applications. I'd really miss its beautiful and descriptive abstraction if I had to drop its use due to poor
performance. I did my best to perform legitimate benchmarks and I'm convinced by their results. If you are not, my dear reader, please
let me know.
There is one more unknown yet left to be discovered. Our parallel streams share one and only
ForkJoinPool in single JVM.
The common knowledge is that we should be careful about this, but what are the real numbers ...
... I didn't know but I did my homework. Please see benchmark results of parallel streams performance used in simulated multiple user application: Performance of parallel Streams