Thursday, July 21, 2011

GC check list for data grid nodes

Recently I have written a series of articles about tuning HotSpot's CMS garbage collector for short pauses. In this article I would like to give a short check list for configuring JVM memory for typical data grid storage node, addressing most important aspects.

Enable concurrent mark sweep CMS and server mode

This article is about HotSpot JVM concurrent mark sweep collector (CMS), so we have to turn it on:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Please also do not forget also to enable server mode for JVM. Server mode advises JVM to use server type defaults for various parameters. You would definitely need them for typical data grid node, so it is better to add -server to your application command line.

Sizing of old space

In JVM configuration you will specify total heap size (old + young generation), but when you are doing sizing exercises you should size them independently.

What should I consider when sizing old space? First, all your application data are stored in old space, you could consider young space is only for garbage. Second, CMS collector needs some headroom to work. More headroom you will provide - more throughput you will get out of CMS collector (this is also true for any other collector). Another important factor is fragmentation, more headroom - less chances to get to fatal level of fragmentation.
My empirical suggestions are
  • at least 1GiB of headroom to keep fragmentation away,
  • headroom size 70%-30% of live data set, dependent on application.
I would recommend you to start with 100% of headroom, then if stress and endurance tests show no problems you may try to reduce headroom to save some memory on the box. If you still have problem even with 100% headroom, when something is probably wrong with your application and you need more in-depth investigation. You can call me BTW :)
 
How would I measure space needed for data? You can use jmap tool from JDK and/or memory profilers to analyze space requirements for your application.  You could also make some forecasting by analyzing your data structures, but it this case you should also account overhead implied by data grid. Following two articles provide some insight of memory overhead in Oracle Coherence data grid.

Sizing of young space

Size of your young generation will affect frequency and in some cases duration of young collection pauses in your application. So you need to find your balance between pause time and pause frequency. Tricky thing is that this balance depends on ratio between generating long lived object and short lived objects, and this ratio is not constant. Typical data grid node usually has several workload types:
  • initial loading/refreshing of data - high ratio of long lived objects in young space,
  • put/get read mostly workload - low ratio of long lived objects in young space,
  • server side processing workload - moderated to high ratio of long lied object in young space with spikes.
Ideally we would like to use different young space configuration for each of these modes, but it is impossible. We have to find setup which can meet SLA in each of these modes.

Making young space too big. This will work well for put/get read mostly workload but will produce longer pauses for initial loading (which is acceptable sometimes). For server side workload your pauses may fluctuate with random spikes, especially if you are running heavy tasks on server side.

Making young space too small. You will get more frequent young GC pauses, more work for GC to move objects around young space, and more chances that short lived object will get to old space making old GC harder (and contributing to fragmentation).
Use -XX:NewSize=<n> -XX:MaxNewSize=<n> to configure young space size.
Also, once you have decided of both young and old space sizes, you can configure total heap size
-Xms=<n> -Xmx=<n>.

Young objects promotion strategy

Promotion strategy depends on application work load. Most common strategies:
  • always tenure - objects are promoted upon first collection,
  • second collection - objects are promoted on second collection they have survived.
Always tenure. This option works better if you have large young space and long intervals between young collections. This way you can avoid coping objects in young space at cost of allowing leaking of small percentage of short lived objects into old space. Though it is good only if you have verified that this leak is really low.
Use -XX:MaxTenuringThreashold=0 -XX:SurvivorRatio=40960 to enable this strategy.

Second collection. This way we can guaranty that only seasoned objects will ever get promoted. Price is that every object in application will be copied at least twice before it end up in old space. This strategy may be required for server side processing workload, which tends to produce spikes of temporary objects. Strategy could be enabled by -XX:MaxTenuringThreashold=1 -XX:SurvivorRatio=<n>. You also should choose reasonable survivor space size.

Why not aging object for longer? Keeping objects in young space is expensive. Young space is using stop-the-world copy collector, so more live object is young space - longer GC pauses will be. For data grid storage node, we usually have very large total heap size which is cleaned by CMS collector, adding a little more garbage to old space should not drastically affect old space live to garbage balance. Also young space is probably also fairly large and periods between collections are long enough to let short lived objects to die off before next collection.

See articles below for more details on young collection mechanics and young space tuning.

Making CMS pauses deterministic

Normally CMS pauses are fairly short, but there are factors which may significantly increase them. Our goal to avoid such eventual longer-than-usual pauses.

Initial marking pause. During this pause CMS should gather all references to objects in old generation. This includes references from thread stacks and from young space. Normally initial marking happens right after young space collection, and number of objects to scan in young space is very small. There are JVM option which is limiting how long CMS would wait for next young collection, before it give up and start scanning young space (i.e. it will have to scan all objects most of which already dead). Such pause can take dozen times longer than usual. Setting wait timeout long enough will help you to prevent such situation.
-XX:CMSWaitDuration=<n>

Remark pause. Same thing is true for remark also. Instead of scanning lots of dead objects in young space it is faster to collect it and scan live ones only. Remark phase cannot wait for young collection, but it can force it. Following JVM option will force young collection before remark.
-XX:+CMSScavengeBeforeRemark

Collecting permanent space

By default CMS will not collect permanent space at all. Sometimes is it ok, but sometimes it may cause a problem (e.g. if you are running JEE or OSGi). You can easily enabled permanent space cleaning by following JVM option.
-XX:+CMSClassUnloadingEnabled

Monitoring fragmentation (optional)

One potential problem with CMS is fragmentation of free space. Following option will allow you to monitor potential fragmentation to foresee possible problems (though it will produce handful amount of logs, so you have to be ready to handle them).
-XX:PrintFLSStatistics=1

Additional links

Thursday, July 14, 2011

Coherence SIG: Using Oracle Coherence to Enable Database Partitioning and DC Level Fault Tolerance

Today I was speaking at Coherence SIG in New York.
Update: 21 July I was speaking in BA Coherence SIG, slide deck is slightly updated
Partitioning is a very powerful technique for scaling database centric applications. One tricky part of partitioned architecture is routing of requests to the right database. The routing layer (routing table) should know the right database instance for each attribute which may be used for routing (e.g. account id, login, email, etc): it should be fast, it should fault tolerant and it should scale. All the above makes Oracle Coherence a natural choice for implementing such routing tables in partitioned architectures. This presentation will cover synchronization of the grid with multiple databases, conflict resolution, cross cluster replication and other aspects related to implementing robust partitioned architecture. 

Tuesday, July 12, 2011

JRockit GC in action

Article about JRockit's garbage collectors and expirience of using it for large heap size.

In this article I would like to elaborate on the garbage collection specifics of Oracle's JRockit JVM. Recently JRockit has been made free for use and many people may consider using it instead of another widely popular Oracle JVM - HotSpot (former Sun's JVM).
JRockit uses mark-sweep-compact (MSC) as its base garbage collection algorithm, though it allows a lot of tweaking. The JVM command line option -Xgc: allows to choose variations of MSC algorithm.
Full text of article is available at JavaLobby http://java.dzone.com/articles/jrockit-gc-action.

Wednesday, July 6, 2011

OpenJDK patch cutting down GC pause duration up to 8 times

Patch describe in article (RFE-7068625) has been included in mainstream OpenJDK.
Available since Oracle's Java 7 update 40.

I have spend a handful of time tuning GC on various JVM during my career. Recently I've been faced with a new challenge - running JVM with 32GiB of heap space. First few tests have shown that factors affecting GC pause time on 32GiB heap are very different from e.g. 1GiB of heap. That was a beginning of this story.

Patch

A patch itself is fairly trivial, unlike amount of research done to find this opportunity for improvement. I do not want to get you to bored, so I'm putting description of patch first and whole story next.

Effect of a patch

Patch itself can considerable reduce pause times for CMS (young collection and remark pauses) and serial collectors (young collection pauses) on large heaps (8GiB and greater). Exact improvement depends on application but if your application is typical server, processing a load of requests, you can expect something like 1.5-4 times cut down to your GC pauses in CMS collector on x86-amd64 architecture.

How it works?

There are many cost factors affecting time of young collection, but here I will focus on 3 of them (which are usually dominant):
1.       effort to scan dirty card table (proportional to heap size),
2.       effort to scan references in dirty cards (application specific, but asymptotically constant as heap size grows),
3.       effort to relocate surviving objects in young space (roughly proportional to size of survived objects).
You can find much more detailed explanation of young collection in one of articles listed below.
These 3 above are dominant factors for GC pause time. Factors 2 and 3 are application specific but they do not depend very much on total JVM heap size (because they are related to young objects). Both factors 2 and 3 can be reduced by more frequent young collection (by shrinking young space). Factor 3 also depends on tenuring policy (you can read more here).
Factor 1 – effort to scan dirty card table – is most interesting. It is application neutral and it proportional to size of old space (heap size - young space + permanent space). So eventually, with grow of heap size, it is becoming dominant factor for GC pause.
HotSpot JVM (OpenJDK) is using 512 byte card for card marking write barrier. It means that for 32GiB of heap garbage collector have to scan roughly 64MiB of card table. How long it may take to scan 64MiB? Memory subsystem of modern server can stream this number of bytes within 1-3ms but experiments shows that JVM is spending dozens of milliseconds to scan card table. This lead me to suspision that JVM code is doing something in suboptimal way here (also competitor JRockit JVM can do young collections much faster, which is another reason to look for a problem in OpenJDK code base).

Code

After browsing GC code base in OpenJDK I have identified a suspicious place in cardTableRS.cpp. Class ClearNoncleanCardWrapper is used in both serial and CMS collectors to find dirty memory regions using card table. Below is original code of do_MemRegion method:
Scanning memory byte by byte has raised a red flag for me. Unaligned byte operations are fairly expensive for modern CPU plus it involves branching for each byte. In case of large heap majority of cards will be clean, so we can expect serious performance bust by adding optimized code path skipping continuous ranges of clean cards. Below is my implementation of same method.
 

Please keep in mind that my version is just a prove-of-concept. I have written this code is just to prove an importance of such optimization. It is not example of good coding and can be optimized further in term of performance. In implementation above, additional code branch is added which can skip continuous ranges of clean cards in very tight loop by 8 cards per cycle (thus only speed of memory should limit its speed of modern architectures).
First test on serial collector have shown desired effect (e.g. on 24GiB of heap scan time has been reduced from  64ms to 8.6ms (almost 8 times improvement).

A bit more work for CMS collector

Parallel collector (CMS's ParNew collector) didn't shown show any positive effect from patch in first place (though it is using same codebase). It turns out that ParNew collector is managing work in very fine grained strides which are distributed between threads. Default stride size is so small that effect of fast loop is neglected. Fortunately stride size can be configured, adding following options have made results of patched JVM much more spectacular.
-XX:+UnlockDiagnosticVMOptions
-XX:ParGCCardsPerStrideChunk=4096

For final judgment of patch I was using a Oracle Coherence data grid node with 28GiB of heap and about 15M of objects in cache. JVM was configured to use CMS collector (with parallel young space collector). Test were performed on Amazon EC2 high memory quadruple extra large instance. Patched open JDK JVM has shown average GC pause time 28.9ms compared, while JDK 6u26 average GC pause was 75.4ms on same test (2.5 times reduction of GC pause duration).

Some history behind this patch

Before identifying inefficiency with card scanning code, I have spend a lot of effort to analyze a cost structure of young GC pauses on large heap. You can find more information about GC pause factors, CMS collector and JVM's tuning options in other my articles:
I have started with synthetic test application which fills up whole heap with hash maps and strings. This approach allowed me to simulate predictable and stable load on GC with any desired heap size (unlike synthetic test, real application's impact of GC depends on heap size, which makes it impossible to do apple to apple comparison of results from different heap sizes). Synthetic test application have two modes:
·         normal mode – fills up heap with objects and continue to replace objects,
·         dry mode – fills up heap with objects, then stops modifying data structures producing only short living objects.
In dry mode young GC have very little work to do, so comparing these two modes allows to identify cost segments of young collection.
Below is diagrams showing results for different JVMs and heap sizes (including patch mentioned above).
Serial collector shows very clean picture, pauses in dry mode are propotional to heap size and grow is quite steep. Patched JVM shows huge reduction of pauses in dry mode (which prooves that card table scanning is dominat factor). Difference between normal and dry mode in JDK6u26 and patched JVM are almost the same, which is expected because except card time scanning their code base is identical.
Results of CMS collector is much more fluctuation, which is expected from multithreaded algorihtm (and yes, diagram shows average numbers from several runs).

We clearly see that CMS collector is showing similar benefit of patch. Increasing ParNew's stride size also seems to have positive effect of pause time for large heaps, though I didn't investigate it thoroughly and it may be caused by other factors.

Conclusion

In many cases optimization similar to one implemented in my patch, could be considered a bad practice. It make code base a little less clear. But my experiments clearly show that this piece of code one of major GC hotspots and it definitely worth some carefully crafted code tuning, because it has dramatic effect on GC pause time - one of critical JVM performance metric. I really hope to see this optimization implemented in mainstream JDK soon. Combined with other tricks it would allow me to put GC pauses of my applications under 50ms envelop (on modern hardware), which I would consider a great achievement for Java in soft-real-time applications.

HotSpot JVM garbage collection options cheat sheet

In this article I have collected a list of options related to GC tuning in JVM. This is not a comprehensive list, I have only collected options which I use in practice (or at least understand why I may want to use them).

HotSpot GC collectors

HotSpot JVM may use one of 6 combinations of garbage collection algorithms listed below.
Young collector
Old collector
JVM option
Serial (DefNew)
Serial Mark-Sweep-Compact
-XX:+UseSerialGC
Parallel scavenge (PSYoungGen)
Serial Mark-Sweep-Compact (PSOldGen)
-XX:+UseParallelGC
Parallel scavenge (PSYoungGen)
Parallel Mark-Sweep-Compact (ParOldGen)
-XX:+UseParallelOldGC
Serial (DefNew)
Concurrent Mark Sweep
-XX:+UseConcMarkSweepGC
-XX:-UseParNewGC
Parallel (ParNew)
Concurrent Mark Sweep
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
G1
-XX:+UseG1GC

GC logging options

JVM option
Description
General options
-verbose:gc or -XX:+PrintGC
Print basic GC info
-XX:+PrintGCDetails
Print more elaborated GC info
-XX:+PrintGCTimeStamps
Print timestamps for each GC event (seconds count from start of JVM)
-Xloggc:<file>
Redirects GC output to file instead of console
-XX:+PrintTenuringDistribution
Print detailed demography of young space after each collection
-XX:+PrintTLAB
Print TLAB allocation statistics
-XX:+PrintGCApplication\
StoppedTime
Print pause summary after each stop-the-world pause
-XX:+PrintGCApplication\
ConcurrentTime
Print time for each concurrent phase of GC
-XX:+HeapDumpAfterFullGC
Creates heap dump file after full GC
-XX:+HeapDumpBeforeFullGC
Creates heap dump file before full GC
-XX:+HeapDumpOnOutOfMemoryError
Creates heap dump in out-of-memory condition
-XX:HeapDumpPath=<path>
Specifies path to save heap dumps
CMS specific options
-XX:PrintCMSStatistics=<n>
Print additional CMS statistics if n >= 1
-XX:+PrintCMSInitiationStatistics
Print CMS initiation details
-XX:PrintFLSStatistics=2
Print additional info concerning free lists
-XX:PrintFLSCensus=2
Print additional info concerning free lists
-XX:+CMSDumpAtPromotionFailure
Dump useful information about the state of the CMS old generation upon a promotion failure.
-XX:+CMSPrintChunksInDump
In a CMS dump enabled by option above, include more detailed information about the free chunks.
-XX:+CMSPrintObjectsInDump
In a CMS dump enabled by option above, include more detailed information about the allocated objects.

JVM sizing options

JVM option
Description
-Xms<size> -Xmx<size>
or
‑XX:InitialHeapSize=<size>
‑XX:MaxHeapSize=
<size>
Initial and max size of heap space (young space + tenured space). Permanent space does not count to this size.
-XX:NewSize=<size> 
-XX:MaxNewSize=<size>
Initial and max size of young space.
-XX:NewRatio=<ratio>
Alternative way to specify young space size. Sets ration of young vs tenured space (e.g. -XX:NewRatio=2 means that young space will be 2 time smaller than tenuted space).
-XX:SurvivorRatio=<ratio>
Sets size of single survivor space as a portion of Eden space size (e.g. -XX:NewSize=64m -XX:SurvivorRatio=6 means that each survivor space will be 8m and eden will be 48m).
-XX:PermSize=<size> 
-XX:MaxPermSize=<size>
Initial and max size of permanent space.
-Xss=<size> or
-XX:ThreadStackSize=<size>
Sets size of stack area dedicated to each thread. Thread stacks do not count to heap size.
-XX:MaxDirectMemorySize=<value>
Maximum size of off-heap memory available for JVM

Young collection tuning

JVM option
Description
-XX:InitialTenuringThreshold=<n>
Initial value for tenuring threshold (number of collections before object will be promoted to tenured space).
-XX:MaxTenuringThreshold=<n>
Max value for tenuring threshold.
-XX:PretenureSizeThreshold=<size>
Max object size allowed to be allocated in young space (large objects will be allocated directly in old space). Thread local allocation bypasses this check so if TLAB is large enough object exciding size threshold still may be allocated in young.
-XX:+AlwaysTenure
Promote all objects surviving young collection immediately to tenured space (equivalent of -XX:MaxTenuringThreshold=0)
-XX:+NeverTenure
Objects from young space will never get promoted to tenured space while survivor space is large enough to keep them.
Thread local allocation blocks
-XX:+UseTLAB
Use thread local allocation blocks in young space. Enabled by default.
-XX:+ResizeTLAB
Allow JVM to adaptively resize TLAB for threads.
-XX:TLABSize=<size>
Initial size of TLAB for thread
-XX:MinTLABSize=<size>
Minimal allowed size of TLAB

CMS tuning options

JVM option
Description
Controlling initial mark phase
-XX:+UseCMSInitiatingOccupancyOnly
Only use occupancy as a criterion for starting a CMS collection.
-XX:CMSInitiating\
OccupancyFraction=<n>
Percentage CMS generation occupancy to start a CMS collection cycle. A negative value means that CMSTriggerRatio is used.
-XX:CMSBootstrapOccupancy=<n>
Percentage CMS generation occupancy at which to initiate CMS collection for bootstrapping collection stats.
-XX:CMSTriggerRatio=<n>
Percentage of MinHeapFreeRatio in CMS generation that is allocated before a CMS collection cycle commences.
-XX:CMSTriggerPermRatio=<n>
Percentage of MinHeapFreeRatio in the CMS perm generation that is allocated before a CMS collection cycle commences, that also collects the perm generation.
-XX:CMSWaitDuration=<timeout>
Once CMS collection is triggered, it will wait for next young collection to perform initial mark right after. This parameter specifies how long CMS can wait for young collection.
Controlling remark phase
-XX:+CMSScavengeBeforeRemark
Force young collection before remark phase.
-XX:+CMSScheduleRemark\
EdenSizeThreshold
If Eden used is below this value, don't try to schedule remark
-XX:CMSScheduleRemark\
EdenPenetration=<n>
The Eden occupancy % at which to try and schedule remark pause
-XX:CMSScheduleRemark\
SamplingRatio=<n>
Start sampling Eden top at least before young generation occupancy reaches 1/n of the size at which we plan to schedule remark
Parallel execution
-XX:+UseParNewGC
Use parallel algorithm for young space collection.
-XX:+CMSConcurrentMTEnabled
Use multiple threads for concurrent phases.
-XX:ConcGCThreads=<n>
Number of parallel threads used for concurrent phase.
-XX:+ParallelGCThreads=<n>
Number of parallel threads used for stop-the-world phases.
CMS incremental mode
-XX:+CMSIncrementalMode
Enable incremental CMS mode. Incremental mode is meant for severs with small number of CPU.
Miscellaneous options
-XX:+CMSClassUnloadingEnabled
If not enabled, CMS will not clean permanent space. You should always enable it in multiple class loader environments such as JEE or OSGi.
-XX:+ExplicitGCInvokesConcurrent
Let System.gc() trigger concurrent collection instead of full GC.
‑XX:+ExplicitGCInvokesConcurrent\
AndUnloadsClasses
Same as above but also triggers permanent space collection.

Miscellaneous GC options

JVM option
Description
-XX:+DisableExplicitGC
JVM will ignore application calls to System.gc()