Wednesday, July 6, 2011

OpenJDK patch cutting down GC pause duration up to 8 times

Patch describe in article (RFE-7068625) has been included in mainstream OpenJDK.
Available since Oracle's Java 7 update 40.

I have spend a handful of time tuning GC on various JVM during my career. Recently I've been faced with a new challenge - running JVM with 32GiB of heap space. First few tests have shown that factors affecting GC pause time on 32GiB heap are very different from e.g. 1GiB of heap. That was a beginning of this story.

Patch

A patch itself is fairly trivial, unlike amount of research done to find this opportunity for improvement. I do not want to get you to bored, so I'm putting description of patch first and whole story next.

Effect of a patch

Patch itself can considerable reduce pause times for CMS (young collection and remark pauses) and serial collectors (young collection pauses) on large heaps (8GiB and greater). Exact improvement depends on application but if your application is typical server, processing a load of requests, you can expect something like 1.5-4 times cut down to your GC pauses in CMS collector on x86-amd64 architecture.

How it works?

There are many cost factors affecting time of young collection, but here I will focus on 3 of them (which are usually dominant):
1.       effort to scan dirty card table (proportional to heap size),
2.       effort to scan references in dirty cards (application specific, but asymptotically constant as heap size grows),
3.       effort to relocate surviving objects in young space (roughly proportional to size of survived objects).
You can find much more detailed explanation of young collection in one of articles listed below.
These 3 above are dominant factors for GC pause time. Factors 2 and 3 are application specific but they do not depend very much on total JVM heap size (because they are related to young objects). Both factors 2 and 3 can be reduced by more frequent young collection (by shrinking young space). Factor 3 also depends on tenuring policy (you can read more here).
Factor 1 – effort to scan dirty card table – is most interesting. It is application neutral and it proportional to size of old space (heap size - young space + permanent space). So eventually, with grow of heap size, it is becoming dominant factor for GC pause.
HotSpot JVM (OpenJDK) is using 512 byte card for card marking write barrier. It means that for 32GiB of heap garbage collector have to scan roughly 64MiB of card table. How long it may take to scan 64MiB? Memory subsystem of modern server can stream this number of bytes within 1-3ms but experiments shows that JVM is spending dozens of milliseconds to scan card table. This lead me to suspision that JVM code is doing something in suboptimal way here (also competitor JRockit JVM can do young collections much faster, which is another reason to look for a problem in OpenJDK code base).

Code

After browsing GC code base in OpenJDK I have identified a suspicious place in cardTableRS.cpp. Class ClearNoncleanCardWrapper is used in both serial and CMS collectors to find dirty memory regions using card table. Below is original code of do_MemRegion method:
Scanning memory byte by byte has raised a red flag for me. Unaligned byte operations are fairly expensive for modern CPU plus it involves branching for each byte. In case of large heap majority of cards will be clean, so we can expect serious performance bust by adding optimized code path skipping continuous ranges of clean cards. Below is my implementation of same method.
 

Please keep in mind that my version is just a prove-of-concept. I have written this code is just to prove an importance of such optimization. It is not example of good coding and can be optimized further in term of performance. In implementation above, additional code branch is added which can skip continuous ranges of clean cards in very tight loop by 8 cards per cycle (thus only speed of memory should limit its speed of modern architectures).
First test on serial collector have shown desired effect (e.g. on 24GiB of heap scan time has been reduced from  64ms to 8.6ms (almost 8 times improvement).

A bit more work for CMS collector

Parallel collector (CMS's ParNew collector) didn't shown show any positive effect from patch in first place (though it is using same codebase). It turns out that ParNew collector is managing work in very fine grained strides which are distributed between threads. Default stride size is so small that effect of fast loop is neglected. Fortunately stride size can be configured, adding following options have made results of patched JVM much more spectacular.
-XX:+UnlockDiagnosticVMOptions
-XX:ParGCCardsPerStrideChunk=4096

For final judgment of patch I was using a Oracle Coherence data grid node with 28GiB of heap and about 15M of objects in cache. JVM was configured to use CMS collector (with parallel young space collector). Test were performed on Amazon EC2 high memory quadruple extra large instance. Patched open JDK JVM has shown average GC pause time 28.9ms compared, while JDK 6u26 average GC pause was 75.4ms on same test (2.5 times reduction of GC pause duration).

Some history behind this patch

Before identifying inefficiency with card scanning code, I have spend a lot of effort to analyze a cost structure of young GC pauses on large heap. You can find more information about GC pause factors, CMS collector and JVM's tuning options in other my articles:
I have started with synthetic test application which fills up whole heap with hash maps and strings. This approach allowed me to simulate predictable and stable load on GC with any desired heap size (unlike synthetic test, real application's impact of GC depends on heap size, which makes it impossible to do apple to apple comparison of results from different heap sizes). Synthetic test application have two modes:
·         normal mode – fills up heap with objects and continue to replace objects,
·         dry mode – fills up heap with objects, then stops modifying data structures producing only short living objects.
In dry mode young GC have very little work to do, so comparing these two modes allows to identify cost segments of young collection.
Below is diagrams showing results for different JVMs and heap sizes (including patch mentioned above).
Serial collector shows very clean picture, pauses in dry mode are propotional to heap size and grow is quite steep. Patched JVM shows huge reduction of pauses in dry mode (which prooves that card table scanning is dominat factor). Difference between normal and dry mode in JDK6u26 and patched JVM are almost the same, which is expected because except card time scanning their code base is identical.
Results of CMS collector is much more fluctuation, which is expected from multithreaded algorihtm (and yes, diagram shows average numbers from several runs).

We clearly see that CMS collector is showing similar benefit of patch. Increasing ParNew's stride size also seems to have positive effect of pause time for large heaps, though I didn't investigate it thoroughly and it may be caused by other factors.

Conclusion

In many cases optimization similar to one implemented in my patch, could be considered a bad practice. It make code base a little less clear. But my experiments clearly show that this piece of code one of major GC hotspots and it definitely worth some carefully crafted code tuning, because it has dramatic effect on GC pause time - one of critical JVM performance metric. I really hope to see this optimization implemented in mainstream JDK soon. Combined with other tricks it would allow me to put GC pauses of my applications under 50ms envelop (on modern hardware), which I would consider a great achievement for Java in soft-real-time applications.

10 comments:

  1. Do you think submitting this patch to openjdk?

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Request for enhancement has been accepted. You can use following link to track/vote.

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7068625

    ReplyDelete
  4. This is really nice. It's the first time I've seen someone outside the Hotspot team patching the C++ code.

    BTW, that byte aligned fast copy - sounds like Duff's device

    ReplyDelete
  5. Really useful and educating.

    Voted for it.

    Thanks, Alexey.

    ReplyDelete
  6. I heard this patch on JVM language summit 2011. You can improve it by 128bits scanning (SIMD instructions on x86) or even 256bits scanning( AVX on sandy bridge or AMD clones)

    ReplyDelete
  7. It might be worth comparing your changed CMS against G1

    ReplyDelete
  8. There is not much point in comparing CMS with G1 (regardless of this patch).

    At the moment G1 cannot sustain type of load I'm interested in (running Oracle Coherence with 30GiB per JVM).

    I will probably give another try in half year or so.

    ReplyDelete
  9. The RFE is closed btw. but only for Java8 and hs24. I think version 24 of hotspot is not in a oracle product yet, right?

    ReplyDelete