From time to time I witness or am involved in the discussion about Java 8 streams performance. The common knowledge is that streams are the compromise between performance and brevity with convenience. I couldn't find any comprehensive set of benchmarks around showing how fast (or slow) streams are. I decided to prepare and run a few tests by myself. Here below I present gathered results.
What and how was tested
I measured performance of several different use cases of forEach
application, like applying some function to list of
numbers and summing it or filtering and grouping such list - usual programming tasks. The same tasks were implemented with different
flavors of Stream
and results were compared.
The choice regarding benchmarking tools was easy, battle tested microbenchmark toolkit - JMH maintained under umbrella of OpenJDK. It's very flexible and easy to use.
See the benchmark runner configuration
Most of the benchmarks were run on lists of different sizes: 1000, 10 000,100 000, 1 000 000 to verify how given implementation performs depending on number of elements it has to process.
The JVM arguments passed in every test were: -Xms2G -Xmx2G -XX:+UseG1GC. The JVM version used was: Java HotSpot(TM) 64-Bit Server 1.8.0_201.
Benchmark results are presented on charts in percents of best result within sample size. Charts are normalized in the way that for given list size best score among different implementations is found and every score is divided by it. Best result is 100 %. When you see 50 % score on the chart, it means that this benchmark was half good as the best one
Results are also displayed in tables in number of operations per second unit. Best results are emphasized.
Sum of integers
This is first benchmark comparing performance of summing integer numbers from list.
You can limit series visible on charts by selecting them from the legend at the top right corner.
And this is results table.
Benchmark | 1000 | 10,000 | 100,000 | 1000,000 |
---|
You can select rows to make results easier to analyze.
It was easy to predict. We have clear winner: forEach
, but its advantage was decreasing with larger items numbers,
especially when compared to parallel version of collect
. Sequential collect
is better on 1000
elements, but then parallel version starts to move ahead. reduce
don't perform well at all, worst in almost all cases but
again, parallel stream with reduce
is doing better when sample is bigger.
If greatest part of your program is responsible for summing reasonable big number of integers, go for forEach
.
If you think I'm little bit sarcastic here ... you are right. It's very unlikely that this level of performance optimizations matters,
especially in the world of REST APIs, Microservices and Event Driven architectures. If we replace Stream.collect()
with forEach
to sum 100 000 integers obtained via some REST endpoint, we don't profit much.
We gain tenths of microseconds compared to seconds required to fetch data from database, serialize, transfer over the wire and
deserialize.
Sum of calculations
Let's take more "realistic" task this time: perform some math operation on list of randomly generated floating-point numbers and sum all calculated values.
Benchmark | 1000 | 10,000 | 100,000 | 1000,000 |
---|
Well, what a surprise! forEach
score is more or less the same as of sequential stream based versions. Parallel stream
implementations are about 3 - 5 times faster. Actually, such results are quite easy to explain.
There are cost related to Stream
creation and threads coordination that can't be mitigated when number of operations to
parallelise is small or the operation itself is fast.
We didn't observe such quick benefits of parallel streams in previous benchmark. According to Amdahl's law defining theoretical limits of parallelism - the greater part of our program is run sequentially, less parallel it can be. This is exactly the case for summing list of integers. Two integers addition is very fast operation, so inherently sequential part responsible for threads synchronization prevailed, especially in case of relatively small quantity of items in the list.
The same law caused slightly better performance of reduce
operation in this benchmark. reduce
contract
requires to use binary associative operation, free from side effects and working on immutable objects. Double.sum()
falls
into this category. If we obey these constraints, no synchronization is required during partial results merge.
Grouping
Grouping some data set is easy, concise and very descriptive with streams therefore is really frequently used, at least in the code I
see on a daily basis. This feature is compared with version implemented imperatively using forEach
loop in this
benchmark. There are plenty of variations that might affect performance: stream can be sequential or parallel, ordered or unordered,
we have several options choosing collector. We can use Stream.reduce()
operator instead of Stream.collect()
or different data structures as grouping containers. That's the reason of multiple tested implementation versions.
The benchmarked task is: take list of randomly generated Double
's, group by result of division by 100 rounded to long
. Produced result is of type Map<Long, List<Double>>
.
Benchmark | 1000 | 10,000 | 100,000 | 1000,000 |
---|
Among sequential implementations, forEach
based benchmarks are slightly better then streams collected to Map
. Using LinkedList
usually improves results insensibly but not for collected parallel stream. Actually, there is severe
performance loss visible when Collectors.toCollection(LinkedList::new)
is used as downstream collector. Parallel stream
optimization attempt, simply speaking, failed miserably. It was only Stream.unordered()
and Collectors.groupingByConcurrent()
added but it caused collectParOpt benchmark to be approximately 10 times slower in every case.
I gave Stream.reuduce()
on parallel stream the try here too. To make parallel reduction working properly, we have to use
immutable data structures. Java HashMap
is mutable. Immutable Map
and List
from Vavr library came to the rescue but with limited success. To be honest, my expectations towards
parallel reduction performance were initially much higher but I shortly realized that expectations were unreasonable. Vavr's Map
has effectively constant get
and put
operations performance characteristics and we have to always
call them both, because there is no similar to plain Java's Map.compute()
operation available in its API.
I wouldn't like to draw any significant conclusions regarding the reasons standing behind achieved results. There is just not enough
data gathered. The question: stream or not to stream seems to have following answers. If we care about microseconds level
performance we should use sequential forEach
or parallel Stream.collect()
. And we'd better skip
optimization attempts without testing them in real life scenarios.
Filtering and sorting distinct items
Another frequently used features of Stream
are filtering, sorting and distinct items selection. Maybe not all together
but separately for sure. This benchmark takes random decimal numbers, filters them by predefined minimal and maximum value, takes
distinct values and sorts in natural order.
Benchmark | 1000 | 10,000 | 100,000 | 1000,000 |
---|
Stream based code is the winner and common pattern reveals: parallel stream are better when given list becomes longer.
Words frequency
This test is inspired by the talk I attended at Geecon 2019 conference where I saw the performance comparison of old school Java code and Java 8 Streams based code. I think that speaker's message was not exactly that Java 8 streams are slow but anyway, presented code triggered my curiosity.
The task is: read text file, split into words, count all words, count number of occurrences of every word, display 10 most frequent words. Frequency is simply word count/total words count.
Original code
This is original code copied from authors Github repository with irrelevant adjustments.
This is original old school Java 2 based implementation:
This is original Java 8 Stream based implementation:
This is additional class used for results display:
This is benchmark code:
Below are benchmark results. This time, charts are not normalized and show just raw results.
Benchmark | Ops/sec |
---|
My first thoughts: not very impressive ... But wait, this code looks suspicious:
- Is it reasonable to use parallel stream for IO operations?
- Is it really fair to use regular expressions based string splitting?
- Does it really help using parallel stream whenever possible?
- Is it really good idea to create intermediate collection for every operation we perform?
Lets try to get rid of above doubts and use simple adjustments to improve stream performance.
Step 1 - Using string tokenizer stream and sequential streams
This is word frequency counter using sequential streams and different line to words splitter:
We use sequential streams because reading file from disk might not be the best candidate for parallelization. In test conditions,
there is no IO performed because operating system is smart enough to keep frequently read files in memory, but even though, stream
created by Path.lines()
is of unknown size what is strong limitation for parallelism too. Additionally, instead of
splitting string and using flat mapping over stream created from strings array, we flat map over stream created from tokens iterator -
TokenizerStream
. This should be more efficient.
Here are java8step1
benchmark results.
Benchmark | Ops/sec |
---|
So far so good, we are better then first stream based approach but still below Java 2 based code score. Let's try to do better.
Step 2 - Parallel streams
Using parallel stream with IO operations is disputable but we can definitely transform tokens stream to parallel and unordered without much risk. Let's use parallel streams for all operations except reading lines from file.
Here are the results.
Benchmark | Ops/sec |
---|
We are ahead previous competitors now, but not much. Let's try something more.
Step 3 - Less collections in between
Splitting code into well defined functions is the very good practice, but it means more overhead for this kind of tasks. We can easily
reduce number of intermediate collections by combining reading from file and word counting into one stream. If we look at previous
examples, we can notice that first list List<String> words
is just the holder for all words from file. We
immediately transform it to Map<String,Long>
with Stream.collect()
. Later on we use all words list to
get total words number. To avoid necessity of holding all words list we, can group words counters into custom Map
implementation counting all put
operations. Using this approach, we will have total words counter for free and we use 2
streams instead of 3.
Benchmark | Ops/sec |
---|
Did you notice that we didn't use parallel streams at all? I did it deliberately to verify what optimizations really matter. Avoiding unnecessary collections creation is enough to gain significant performance boost.
Step 4 - Single pass parallel
If we profit from limiting number of created collections and by using parallel streams separately, let's see what results will be if we combine both improvements.
Benchmark | Ops/sec |
---|
This time I'm satisfied. Optimized stream based code is about 3 times faster than original Java 8 example and about 2 times faster
than original Java 2 implementation. I must admit that using parallel streams is cheating and CountingMap
could be used
to optimize Java 2 based code too. On the other side java8step1
did quite well. We already know that Stream
requires some additional resources and we benefit more when we give it more time to work on given task. There is also very important
lesson from Words Frequency example:
Don't use parallel streams blindly, always think if there is any potential for parallelism.
Conclusions
It looks like I found nothing revealing. Stream can be sometimes faster and sometimes slower than standard loop. I will be frank with
you - I'm really glad that Java 8 Stream
is good enough in regards of performance. At least in the scope of common
enterprise class applications. I'd really miss its beautiful and descriptive abstraction if I had to drop its use due to poor
performance. I did my best to perform legitimate benchmarks and I'm convinced by their results. If you are not, my dear reader, please
let me know.
There is one more unknown yet left to be discovered. Our parallel streams share one and only ForkJoinPool
in single JVM.
The common knowledge is that we should be careful about this, but what are the real numbers ...
... I didn't know but
I did my homework. Please see benchmark results of parallel streams performance used in simulated multiple user application: Performance of parallel Streams