Friday, June 14, 2013

Java GC in Numbers – Parallel Young Collection

This is a first articles in series, where I would like to study effect of various HotSpot JVM options on duration of STW pauses associated with garbage collection.

This article will study how number of parallel threads affects duration of young collection Stop-the-World pause. HotSpot JVM has several young GC algorithms. My experiments are covering following combinations:

  • Serial young (DefNew), Mark Sweep Compact old
  • Parallel young (ParNew), Mark Sweep Compact old
  • Serial young (DefNew), Concurrent Mark Sweep old
  • Parallel young (ParNew), Concurrent Mark Sweep old
  • There is also PSNew (Parallel Scavenge) algorithm similar to ParNew, but it cannot be used together with Concurrent Mark Sweep (CMS), so I have ignored it.

    In experiments, I was using synthetic benchmark producing evenly distributed load on memory subsystem. Size of young generation was same for all experiments (64MiB). Two versions of HotSpot JVM were used: JDK 6u43 (VM 20.14-b01) and JDK 7u15 (VM 23.7-b01).

    Test box was equipped with two 12 core x 2 hardware threads CPUs (totaling in 48 hardware threads).

    Mark Sweep Compact

    Mark Sweep Compact is prone to regular full GCs, so it is not a choice for pause sensitive applications. But it shares same young collection algorithms/code with concurrent collector and produces less noisy results, so I added to better understand concurrent case.

    Difference between single thread case and 48 thread case is significant so number are present in two graphics.

    Note worthy (not surprising though), that serial algorithm performs slightly better than parallel with one thread. Discrepancy between Java 6 and Java 7 is also interesting, but I have no ideas now to explain that.

    From graphics above you can get an idea that more threads is better, but it is not obvious how exactly better. Graphics below show effective parallelization (8 thread case is taken as base value, because smaller numbers of threads are producing fairly noisy results).

    You can see almost linear parallelization up to 16 threads. It is also worth to note, that 48 threads are considerably faster that 24 even though there are only 24 physical cores. Effect of parallelization is slightly better for larger heap sizes.

    Concurrent Mark Sweep

    Concurrent Mark Sweep is a collector used for pause sensitive applications and young collection pause time is something that you probably really care if you have consciously chosen CMS. Same hardware and same benchmark were used.
    Results are below.

    Compared to Mark Sweep Compact, concurrent algorithm is producing much noisy results (especially for small number of threads).

    Java 7 is systematically showing worse performance compared to Java 6, not too much though.

    Parallelization diagrams, show us same picture - linear scalability, which degrades with greater number of threads (experiment conditions is slightly different for CMS and MSC cases, so direct comparison of these diagrams is not correct).

    Conclusions

    Tests have confirmed that parallel young collection algorithms in HotSpot JVM scales extremely well by number of CPU cores. Having a lot of CPU cores on server will help you greatly with JVM Stop-the-World pauses.

    Source code

    Source code used for benchmarking and its description is available at GitHub.
    github.com/aragozin/jvm-tools/tree/master/ygc-bench