Thursday, July 21, 2011

GC check list for data grid nodes

Recently I have written a series of articles about tuning HotSpot's CMS garbage collector for short pauses. In this article I would like to give a short check list for configuring JVM memory for typical data grid storage node, addressing most important aspects.

Enable concurrent mark sweep CMS and server mode

This article is about HotSpot JVM concurrent mark sweep collector (CMS), so we have to turn it on:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Please also do not forget also to enable server mode for JVM. Server mode advises JVM to use server type defaults for various parameters. You would definitely need them for typical data grid node, so it is better to add -server to your application command line.

Sizing of old space

In JVM configuration you will specify total heap size (old + young generation), but when you are doing sizing exercises you should size them independently.

What should I consider when sizing old space? First, all your application data are stored in old space, you could consider young space is only for garbage. Second, CMS collector needs some headroom to work. More headroom you will provide - more throughput you will get out of CMS collector (this is also true for any other collector). Another important factor is fragmentation, more headroom - less chances to get to fatal level of fragmentation.
My empirical suggestions are
  • at least 1GiB of headroom to keep fragmentation away,
  • headroom size 70%-30% of live data set, dependent on application.
I would recommend you to start with 100% of headroom, then if stress and endurance tests show no problems you may try to reduce headroom to save some memory on the box. If you still have problem even with 100% headroom, when something is probably wrong with your application and you need more in-depth investigation. You can call me BTW :)
 
How would I measure space needed for data? You can use jmap tool from JDK and/or memory profilers to analyze space requirements for your application.  You could also make some forecasting by analyzing your data structures, but it this case you should also account overhead implied by data grid. Following two articles provide some insight of memory overhead in Oracle Coherence data grid.

Sizing of young space

Size of your young generation will affect frequency and in some cases duration of young collection pauses in your application. So you need to find your balance between pause time and pause frequency. Tricky thing is that this balance depends on ratio between generating long lived object and short lived objects, and this ratio is not constant. Typical data grid node usually has several workload types:
  • initial loading/refreshing of data - high ratio of long lived objects in young space,
  • put/get read mostly workload - low ratio of long lived objects in young space,
  • server side processing workload - moderated to high ratio of long lied object in young space with spikes.
Ideally we would like to use different young space configuration for each of these modes, but it is impossible. We have to find setup which can meet SLA in each of these modes.

Making young space too big. This will work well for put/get read mostly workload but will produce longer pauses for initial loading (which is acceptable sometimes). For server side workload your pauses may fluctuate with random spikes, especially if you are running heavy tasks on server side.

Making young space too small. You will get more frequent young GC pauses, more work for GC to move objects around young space, and more chances that short lived object will get to old space making old GC harder (and contributing to fragmentation).
Use -XX:NewSize=<n> -XX:MaxNewSize=<n> to configure young space size.
Also, once you have decided of both young and old space sizes, you can configure total heap size
-Xms=<n> -Xmx=<n>.

Young objects promotion strategy

Promotion strategy depends on application work load. Most common strategies:
  • always tenure - objects are promoted upon first collection,
  • second collection - objects are promoted on second collection they have survived.
Always tenure. This option works better if you have large young space and long intervals between young collections. This way you can avoid coping objects in young space at cost of allowing leaking of small percentage of short lived objects into old space. Though it is good only if you have verified that this leak is really low.
Use -XX:MaxTenuringThreashold=0 -XX:SurvivorRatio=40960 to enable this strategy.

Second collection. This way we can guaranty that only seasoned objects will ever get promoted. Price is that every object in application will be copied at least twice before it end up in old space. This strategy may be required for server side processing workload, which tends to produce spikes of temporary objects. Strategy could be enabled by -XX:MaxTenuringThreashold=1 -XX:SurvivorRatio=<n>. You also should choose reasonable survivor space size.

Why not aging object for longer? Keeping objects in young space is expensive. Young space is using stop-the-world copy collector, so more live object is young space - longer GC pauses will be. For data grid storage node, we usually have very large total heap size which is cleaned by CMS collector, adding a little more garbage to old space should not drastically affect old space live to garbage balance. Also young space is probably also fairly large and periods between collections are long enough to let short lived objects to die off before next collection.

See articles below for more details on young collection mechanics and young space tuning.

Making CMS pauses deterministic

Normally CMS pauses are fairly short, but there are factors which may significantly increase them. Our goal to avoid such eventual longer-than-usual pauses.

Initial marking pause. During this pause CMS should gather all references to objects in old generation. This includes references from thread stacks and from young space. Normally initial marking happens right after young space collection, and number of objects to scan in young space is very small. There are JVM option which is limiting how long CMS would wait for next young collection, before it give up and start scanning young space (i.e. it will have to scan all objects most of which already dead). Such pause can take dozen times longer than usual. Setting wait timeout long enough will help you to prevent such situation.
-XX:CMSWaitDuration=<n>

Remark pause. Same thing is true for remark also. Instead of scanning lots of dead objects in young space it is faster to collect it and scan live ones only. Remark phase cannot wait for young collection, but it can force it. Following JVM option will force young collection before remark.
-XX:+CMSScavengeBeforeRemark

Collecting permanent space

By default CMS will not collect permanent space at all. Sometimes is it ok, but sometimes it may cause a problem (e.g. if you are running JEE or OSGi). You can easily enabled permanent space cleaning by following JVM option.
-XX:+CMSClassUnloadingEnabled

Monitoring fragmentation (optional)

One potential problem with CMS is fragmentation of free space. Following option will allow you to monitor potential fragmentation to foresee possible problems (though it will produce handful amount of logs, so you have to be ready to handle them).
-XX:PrintFLSStatistics=1

Additional links

12 comments:

  1. The best way is to use JRockit instead of HotSpot.

    ReplyDelete
  2. I would be very intersting to know how have you handeled delayed young collection problem in gencon collector of JRockit.
    See http://java.dzone.com/articles/jrockit-gc-action for detailsabout this problem

    ReplyDelete
  3. Alexey, thanks for great articles. Read all your HotSpot GC articles (in Dzone and GD blog), faced the real production problem with stop-the-world old generation cleaning on Glassfish server (each one took about a minute), now switched to CMS one, will see the results during high workload hours...

    Could you advise the good profiler tool (with statistics on objects count and usage) for remote profiling? VisualVM gathers only memory statistics for remote servers, but not profiling information. AppDynamics tools didn't work in my application. Jprofiler or Jprobe? Though they are both paid.

    Thanks.

    ReplyDelete
  4. Alexey, hi.

    Posted the comment here, but lost somewhere :(

    First of all thanks (спасибо) for perfect articles about memory management. I struggle with memory problems in GF now, used to use default old generation GC mechanism, but now switched to CMS.

    Several questions:
    1. what is headroom for CMS? Do you mean ‑XX:+UseCMSInitiatingOccupancyOnly ‑XX:CMSInitiatingOccupancyFraction=70?
    2. Without mentioned parameters I recieve one big Full GC after an hour and a half after system started (heap became about 80% full):
    3661.247: [Full GC (System) 3661.247: [CMS: 559389K->218849K(691200K), 37.9083820 secs] 716855K->218849K(898560K), [CMS Perm : 126590K->124486K(157652K)], 37.9172610 secs] [Times: user=4.01 sys=8.76, real=37.92 secs]
    I'm worried about 37.92 - is this the real outage of the system or lots of work is really done in parallel with system work?
    3. Should I set some configuration to make CMS GC more frequent and therefore shorter?

    Thanks, Alexey

    ReplyDelete
  5. Some notes:
    I got the CMS process only after adding these parameters:
    -XX:+UseCMSInitiatingOccupancyOnly
    -XX:CMSInitiatingOccupancyFraction=50 (should I increase it?)

    The log I wrote above seemed to be the standard Full GC, initiated instead of CMS. UseCMSInitiatingOccupancyOnly is the key to success, now I get the cool CMS logs and no long pauses take place now (I hope)

    ReplyDelete
  6. Hi Anton,

    First, I need more logs to be more helpful.

    CMSInitialOccupancyFraction controls when CMS start collection. You have relatively small heap - so I suppect that you are producing garbage to intesively.
    Probably you should increase young generation to kill more objects there.
    Again without full GC logs I can only speculate.

    ReplyDelete
  7. Alexey, hi again.

    Have you experienced any problems with fragmentation of old memory due to CMS in practice? Any influence on young GC (due to slower copy of live objects to tenured space)?
    I use JSF web application and due to specifics of this technology old space gc are rather frequent (every 20-30 minutes as there is lots of garbage there by this time), so it is cleaned rather often, but cleaned space is not compacted. Can this have any impact on the GC work in the future?
    Thanks

    ReplyDelete
  8. Hi Anton,

    This a normal practice to run old CMS collection frequently. Fragmentation can affect time of young collection, but I doubt it could be a problem.

    Do you expirience increasing of young pauses over time?

    Again, GC logs could be helpful ;)

    ReplyDelete
  9. This is my headache. I changed JSF configuration not to save too many views in session, I found memory leak in hibernate, but still see problems. Today I saw, that old generation is not cleaned well by the end of day, so some leaks take place. I think, this affects the very poor work of CMS (
    54864.557: [CMS-concurrent-mark: 60.300/78.964 secs] [Times: user=13.90 sys=14.44, real=78.97 secs]

    and 55022.126: [CMS-concurrent-sweep: 93.369/131.584 secs] [Times: user=23.40 sys=16.22, real=131.60 secs] )

    But what is great question to me: even in the start of day, when memory seems ok) young gc has some extremums, e.g.:
    43936.918: [GC 43936.919: [DefNew: 186567K->2142K(207360K), 5.1254300 secs] 647029K->463091K(898560K), 5.1272270 secs] [Times: user=0.03 sys=0.02, real=5.13 secs]

    54269.802: [GC 54269.803: [DefNew: 185643K->2213K(207360K), 4.9494320 secs] 566589K->383247K(898560K), 4.9498890 secs] [Times: user=0.02 sys=0.00, real=4.96 secs]

    I uploaded gc log here:

    http://narod.ru/disk/25663119001/gc-110919%20044.log.html

    Thanks, didn't want to bother you, but would appreciate a lot if you find little time to look there :)

    ReplyDelete
  10. Hi Anton,

    Few comment about your logs
    - you are running single threaded CMS, is your hardware is single core
    - your full GC are triggered by System.gc(). RMI have this habbit to periodically call System.gc(). This can be fixed by disabling explicit GC (using concurrent instead). See options cheatsheep for details.
    - from your log:
    DefNew: ...[Times: user=0.03 sys=0.02, real=5.13 secs]
    DefNew: ...[Times: user=0.02 sys=0.03, real=10.69 secs]
    JVM has spend just 0.05s of CPU time (including kernel) doing GC, but wallclock pause time was few seconds. One reason could be JNI, use -XX:+PrintGCJNIStalls to verify this. Also -XX:+PrintCheckpointStatistics may revel something.
    Another reason for this could be just extremely overloaded server. Are you sure that your application get enough CPU resources?

    ReplyDelete
  11. Do you think it is reasonable to use large -XX:CMSWaitDuration like 300s? I somewhat have the problem that especially in less loaded times the preclean get aborted producing distrous STW pauses. (11 times in the following log):

    ReplyDelete
    Replies
    1. 1. Yes I think 300s is reasonable setting for -XX:CMSWaitDuration
      2. -XX:CMSWaitDuration affects initial mark pause, your problem is probably remark pauses.
      Consider -XX:+CMSScavengeBeforeRemark

      Delete