This is a first articles in series, where I would like to study effect of various HotSpot JVM options on duration of STW pauses associated with garbage collection.
This article will study how number of parallel threads affects duration of young collection Stop-the-World pause. HotSpot JVM has several young GC algorithms. My experiments are covering following combinations:
There is also PSNew (Parallel Scavenge) algorithm similar to ParNew, but it cannot be used together with Concurrent Mark Sweep (CMS), so I have ignored it.
In experiments, I was using synthetic benchmark producing evenly distributed load on memory subsystem. Size of young generation was same for all experiments (64MiB). Two versions of HotSpot JVM were used: JDK 6u43 (VM 20.14-b01) and JDK 7u15 (VM 23.7-b01).
Test box was equipped with two 12 core x 2 hardware threads CPUs (totaling in 48 hardware threads).
Mark Sweep Compact
Mark Sweep Compact is prone to regular full GCs, so it is not a choice for pause sensitive applications. But it shares same young collection algorithms/code with concurrent collector and produces less noisy results, so I added to better understand concurrent case.
Difference between single thread case and 48 thread case is significant so number are present in two graphics.
Note worthy (not surprising though), that serial algorithm performs slightly better than parallel with one thread. Discrepancy between Java 6 and Java 7 is also interesting, but I have no ideas now to explain that.
From graphics above you can get an idea that more threads is better, but it is not obvious how exactly better. Graphics below show effective parallelization (8 thread case is taken as base value, because smaller numbers of threads are producing fairly noisy results).
You can see almost linear parallelization up to 16 threads. It is also worth to note, that 48 threads are considerably faster that 24 even though there are only 24 physical cores. Effect of parallelization is slightly better for larger heap sizes.
Concurrent Mark Sweep
Concurrent Mark Sweep is a collector used for pause sensitive applications and young collection pause time is something that you probably really care if you have consciously chosen CMS. Same hardware and same benchmark were used.
Results are below.
Compared to Mark Sweep Compact, concurrent algorithm is producing much noisy results (especially for small number of threads).
Java 7 is systematically showing worse performance compared to Java 6, not too much though.
Parallelization diagrams, show us same picture - linear scalability, which degrades with greater number of threads (experiment conditions is slightly different for CMS and MSC cases, so direct comparison of these diagrams is not correct).
Conclusions
Tests have confirmed that parallel young collection algorithms in HotSpot JVM scales extremely well by number of CPU cores. Having a lot of CPU cores on server will help you greatly with JVM Stop-the-World pauses.
Source code
Source code used for benchmarking and its description is available at GitHub.
github.com/aragozin/jvm-tools/tree/master/ygc-bench
Alexey, can you give some more detail the test method you used? E.g. did you fill up the OldGen with dead stuff before timing the newgen collections? Perhaps post some source code?
ReplyDeleteI have added link to GitHub where you can find code and description.
DeleteGreat work Alexey, Thanks.
ReplyDeleteCould you explain how rate of Object allocation is calculated from the gc log ? They are collected separately for different pools ?
ReplyDeleteSorry, I'm missing your question.
ReplyDeleteObject always (almost) are allocated in young space. Young space is collected separately. When young is collected few object can get promoted to old space. Old space may be collected separately (e.g. CMS) or together with young space (e.g. MSC).