Wednesday, October 12, 2011

Java GC, HotSpot's CMS and heap fragmentation

Concurrent Mark Sweep (CMS) is one of garbage collection algorithms implemented in HotSpot JVM. CMS is designed to be mostly concurrent, requiring just two quick stop-the-world pauses per old space garbage collection cycle. But if CMS cannot keep up with allocation of memory by application, it may fallback to stop-the-world whole heap collection - infamous Full GC pause. Full GC is very noticeable application pause, on large heaps it can take several dozens of seconds to collect and compact whole heap. 

There are two different failure scenarios for CMS collection (both of them are leading to longer than usual stop-the-world pauses):
  • concurrent mode failure,
  • promotion failure.

Concurrent mode failure

At beginning  of each young GC, collector should ensure that there is enough free memory in old space to promote aged objects from young space. Modern CMS collector estimates size of objects to be promoted using statistics from previous collections. If old space does not have enough free bytes to hold estimated promotion amount, concurrent mode failure will be raise. Concurrent mode failure doesn't necessary lead to Full GC, in certain cases JVM will just wait for concurrent collection cycle to finish, but application will remain in STW pause until young collection will be finished.

Most frequent reason for concurrent mode failure is late initiation of CMS cycle. JVM tries to estimate amount of garbage in heap and duration of CMS cycle and start it as late as possible to avoid wasting of CPU. Unfortunately this estimation may be too optimistic. You can advise JVM to start CMS earlier using following flags:
-XX:CMSInitiatingOccupancyFraction=30
-XX:+UseCMSInitiatingOccupancyOnly
Setting above will force CMS cycle is more than 30% of old space is use. Second option disables JVM heuristics, without second parameter JVM may not obey CMS initiating occupancy fraction setting.

Normally for server type applications you would like CMS to be running continuously. If you are experiencing concurrent mode failure, even though next CMS cycle is starting right after previous, it means that CMS throughput is just not enough. In this case you should increase size of old generation and give CMS collector more head room to do its job. Alternatively you may try to dedicate more CPU cores for concurrent collector, but CPU is usually even more limited resource on modern servers than memory.

In summary, there are two reasons for concurrent mode failure STW pause mentioned above, both of them can be remedied fairly easily with JVM options.

Promotion failure

Promotion failure is more complicated situation. CMS collector is not compacting free memory in old space, instead it have to deal with fragmented free space (a set of free memory chunks). It is possible, that all free bytes are scattered though small chunks, and it is impossible to find certain amount of continuous memory to promote particular large object, even though total number of free bytes is large enough.

Heap fragmentation is well known problem, and there are few effective techniques reducing fragmentation.

CMS memory manager is using separate free lists for different size of chunks. Using these free lists, it can effectively fill small holes in fragmented memory space with objects of exact size. This technique is known to be fairly effective (and widely used in C/C++ memory managers). But, surprisingly, it doesn't seems to work well for JVM in many real live situations.

What is wrong with a way JVM allocating memory in old space?

PLABs

Update: According to comment below, I'm be very wrong here. Please read comment from Ramki and give me some time to grok this staff and rewrite section.
Also take a look at next articale which have some numbers to back me up
CMS heap fragmentation. Follow up 1

Update 2: I have written another article specifically explaining PLABs in CMS. See Java GC, HotSpot's CMS promotion buffers

Section below describes old PLAB logic, which is still present in code base, but not used any more.
Let me make a step aside and explain some details of JVM memory management. You may know that JVM is using TLABs (thread local allocation block) to avoid concurrent allocation from single memory source over multiple threads. Same idea is also used for parallel young space collector. Collecting threads are copying live objects (young collection is a copy collection) and they need to allocate memory for objects being copied. To avoid CPU cache contention (due to synchronization of caches between CPUs) each thread have two, one for young space and one for old space, PLABs (parallel or promotion local allocation buffer) of its own. Using PLAB in young space is totally make sense, free memory is always continuous there, but using PLAB for old space in practice makes a huge contribution to fragmentation.

Let me illustrate this by example.
Each collector thread is allocating chunk of continuous memory in old space. Then, during collection, these chunks are filled with objects of different size (large objects will be allocated directly in old space, but small objects will be placed in PLAB).
Once young collection is finished, our PLABs are full with objects. Then, after some time, CMS cycle takes place and sweeper mark few dead objects as free space, creating small free memory chunks.
But on next young collection, instead of reusing small chunks of memory, each GC thread will allocate new large continuous block and will fill it with small objects.
Effectively, due to using PLABs, free lists are not used for small objects and small chunks will never be reused (until JVM could coalesce them into larger chunks). In practice, in our applications, we have a lot of small objects which can live long enough to die in old space. Strings and hash map entries are just to name few such cases.

Ironically, while CMS has sophisticated free list machinery, it makes it inefficient by using PLABs.

Important: There is bug in JDK prior 6u25, which makes fragmentation even worse. See CR-6999988 for more details.

Can we workaround this issue?
Yes, we can reduce size of PLAB improving reuse of small chunks.
-XX:OldPLABSize=16
-XX:-ResizeOldPLAB
Options above will force JVM to use PLAB as small as 16 heap words (heap word = 4 bytes in 32 bit JVM or 64 bit JVM with compressed pointers enabled, 8 bytes otherwise).

My experiments have shown drastic reduction of heap fragmentation in few application with known promotion failure problems. Though options above will help you reduce fragmentation, I would recommend to use them only if you HAVE problems with fragmentation. Using PLAB is important from multi core prospective and such soft disabling of PLAB may have serious impact on young collection pause time.

Remainder: Section above was based on my missunderstaning of PLABs. Correct description can be found here.
Next section is ok :) 

How to diagnose problems with fragmentation?

Monitoring heap fragmentation

Below are options, that will help you diagnose and monitor problem with CMS heap fragmentation.
-XX:+PrintGCDetails
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
PrintGCDetails is MUST HAVE base line GC diagnostic option. PrintPromotionFailure will print few more details about promotion failures (see example below).
592.079: [ParNew (0: promotion failure size = 2698)  (promotion failed): 135865K->134943K(138240K), 0.1433555 secs]
PrintFLSStatistics=1 will print free list statistics each young or old collection (see example below).
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 40115394
Max   Chunk Size: 38808526
Number of Blocks: 1360
Av.  Block  Size: 29496
Tree      Height: 22
Most important parameter of FLS (free list space) statistics is max chunk size. If size of max chunk decreasing over time, that means increasing fragmentation of heap and risk of promotion failure.

Conclusion                     

Let me summarize, how to reduce risk of concurrent mode and promotion failures using CMS collector.
1. Provide CMS enough head room, more free space – less risk of fragmentation. Remember any garbage collector need some extra space to work efficiently.
2. Let CMS run continuously using initiating occupancy settings (you may want reduce number of concurrent threads though).
3. If you still having problems – reduce old PLAB size.
4. Avoid spurious allocation of very large objects in application (resize of large hash table is a good example). Regular allocation of large objects is ok, though.

See also

Java GC, HotSpot's CMS promotion buffers 
CMS heap fragmentation. Follow up 1
HotSpot JVM garbage collection options cheat sheet
Understanding GC pauses in JVM, HotSpot's minor GC
Understanding GC pauses in JVM, HotSpot's CMS collector
How to tame java GC pauses? Surviving 16GiB heap and greater



15 comments:

  1. Hi Alexey -- Thanks for the blogs. Which JVM version were you using in your experiments? Are you aware of Sun CR 6999988 ? As far as I can tell, this is fixed in 6u25 / hs20 for the first time. The existence of that bug led to all of the "sophisticated machinery" introduced in 6631166 being ineffective.

    thanks.
    -- ramki

    ReplyDelete
  2. Hi Ramki,

    Very interesting note.
    I was using JDK 6u26 in may tests. I have repeated my test with 6u24. Indeed, it have shown much more frequent promotion failure events compared to 6u26.

    Regards,
    Alexey

    ReplyDelete
  3. Hi Alexey -- Thanks for confirming that the fragmentation is worse with use of "PLABs" in
    the old generation, even after the fix for 6999988.

    I'd like to point out though, one small inaccuracy in your description of the use of "PLABs" for the CMS old space when allocating small blocks. As you know, the small blocks come from per-size free lists indexed by size. This pool is, of course, populated, when empty by means of allocating a large block that is a multiple of the small block size list that we are trying to allocate out of and carving it into a number of smaller blocks. However, this populating of free lists is orthogonal to the PLAB allocation in the sense that some subset of these are then removed from the global pool and added to the per-thread cache. This cache of free blocks of various sizes constitutes the PLAB from which a specific thread will allocate. The number of blocks that will be fetched into this PLAB is dynamically adjusted based on perceived recent demand, suitably averaged over the set of GC worker threads, and low-pass filtered to avoid large fluctuations. Thus, in steady state conditions the number that are fetched into the local cache should closely track the number allocated in each scavenge. By setting OldPLABSize to 16 and turning off dynamic resizing of this parameter, one could lead to excess inventory lying around in the caches and the global free pool.

    One final note. OldPLABSize is interpreted in CMS as not the size of the PLAB in words, but rather the number of free blocks of a given size that are used when replenishing the local per-worker free list caches.

    I agree that any changes in these parameters should be done with the utmost care for the reasons you provide. We have a number of customers who saw improved scavenge times as well as improved fragmentation behaviour as a result of the changes in 6631166+6699988.

    Thanks again for your well thought out and well-written blogs!

    -- ramki

    ReplyDelete
  4. Ramki,

    Thank you, for clarification. I will update post, but want spent sometime to better understand CMS allocation buffers fisrt.

    In mean time, I have published results from my tests http://aragozin.blogspot.com/2011/10/cms-heap-fragmentation-follow-up-1.html

    ReplyDelete
  5. Hello Alexey -- Thanks for the blogs.
    I read this article further, I think it's great.

    ReplyDelete
  6. It was really nice to study your post. I collect some good points here. I would like to be appreciative you with the hard work you have made in skill this is great article.

    ReplyDelete
  7. Hi Alexey,
    I have gone through your blogs and they are an excellent source of rich information on GC patterns in Java. We are running into to some serious issues with the GCs in our production environment where we are seeing numerous, back to back promotion failures. We are using JDK 1.6.0_24 and below are the JVM settings. I am desperate for any help you can provide. Thanks in advance!

    - Vasanth

    -XX:PermSize=256m -XX:MaxPermSize=256m -XX:NewSize=512m -XX:MaxNewSize=512m -XX:SurvivorRatio=8 -XX:+DisableExplicitGC -Xms2560m -Xmx2560m -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:ParallelGCThreads=12 -Dweblogic.SocketReaders=12 -XX:-UseBiasedLocking -XX:LargePageSizeInBytes=128m -XX:-UseThreadPriorities -Dnetworkaddress.cache.ttl=180 -XX:+PrintGCTimeStamps -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1

    Excerpts from GC logs showing promotion failures:

    47966.117: [ParNew (9: promotion failure size = 4997124) (promotion failed): 455476K->455653K(471872K), 0.0703983 secs]47966.188: [CMSCMS: Large block 0x8af1b6f0
    : 1221978K->515117K(2097152K), 15.4494470 secs] 1667493K->515117K(2569024K), [CMS Perm : 177716K->174905K(262144K)]After GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 404984388
    Max Chunk Size: 404984388
    Number of Blocks: 1
    Av. Block Size: 404984388
    Tree Height: 1
    After GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 0
    Max Chunk Size: 0
    Number of Blocks: 0
    Tree Height: 0
    , 15.5235571 secs] [Times: user=17.57 sys=0.06, real=15.52 secs]

    48225.553: [ParNew (10: promotion failure size = 8083460) (promotion failed): 457342K->457341K(471872K), 0.0770736 secs]48225.630: [CMSCMS: Large block 0xd89239b8
    48226.011: [CMS-concurrent-sweep: 2.528/2.825 secs] [Times: user=7.61 sys=0.10, real=2.83 secs]
    (concurrent mode failure)CMS: Large block 0x8afc35d8
    : 1701999K->515789K(2097152K), 14.7651760 secs] 2143373K->515789K(2569024K), [CMS Perm : 175107K->175061K(262144K)]After GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 404812426
    Max Chunk Size: 404812426
    Number of Blocks: 1
    Av. Block Size: 404812426
    Tree Height: 1
    After GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 0
    Max Chunk Size: 0
    Number of Blocks: 0
    Tree Height: 0
    , 14.8443607 secs] [Times: user=15.40 sys=0.04, real=14.85 secs]

    ReplyDelete
    Replies
    1. You seems to be allocating pretty large objects (your log references 4MiB and 8MiB block allocation failure). Free lists are not covering such large objects.

      I have two guesses:
      1. You have very big HashMap or several (couple millions of keys) and failure is causes by hash table resize.
      2. You are allocating large arrays frequently.

      In first key, you should reserve enough capacity in HashMap on creation to avoid further resizes.
      In second case, only increasing heap size would probably help.

      Any way analyzing heap dump should help you understand your memory occupation.

      Delete
  8. Alexey, I am having the both errors on my application: concurrent mode failure & Promotion failed.

    Here is my log: 2013-08-27T16:59:53.952-0500: 201564.214: [GC 201564.214: [ParNew202182.528: [SoftReference, 0 refs, 0.0000110 secs]202182.528: [WeakReference, 1015 refs, 0.0001720 secs]202182.528: [FinalReference, 1625 refs, 0.0025210 secs]202182.531: [PhantomReference, 1 refs, 0.0000050 secs]202182.531: [JNI Weak Reference, 0.0000100 secs] (promotion failed): 3774912K->3774912K(3774912K), 619.3859950 secs]202183.600: [CMS CMS: abort preclean due to time 2013-08-27T17:10:14.937-0500: 202185.199: [CMS-concurrent-abortable-preclean: 7.271/626.671 secs] [Times: user=699.20 sys=98.95, real=626.55 secs]
    (concurrent mode failure)202189.179: [SoftReference, 614 refs, 0.0000710 secs]202189.179: [WeakReference, 5743 refs, 0.0007600 secs]202189.180: [FinalReference, 3380 refs, 0.0004430 secs]202189.180: [PhantomReference, 212 refs, 0.0000260 secs]202189.180: [JNI Weak Reference, 0.0000180 secs]: 8328471K->4813636K(8388608K), 18.5940310 secs] 11323577K->4813636K(12163520K), [CMS Perm : 208169K->208091K(346920K)], 637.9802050 secs] [Times: user=705.47 sys=98.79, real=637.85 secs]
    Total time for which application threads were stopped: 637.9820120 seconds

    Could you advise for the jvm parameter tuning?

    Here my server information:
    -Xmx12288m
    -Xms12288m
    -XX:MaxPermSize=1024m
    -XX:MaxNewSize=4096m
    -XX:InitialCodeCacheSize=128m
    -XX:ReservedCodeCacheSize=256m
    -XX:+UseCodeCacheFlushing
    -XX:+UseParNewGC
    -XX:+UseConcMarkSweepGC
    -XX:+CMSParallelRemarkEnabled
    -XX:+CMSClassUnloadingEnabled
    -XX:+UseCodeCacheFlushing
    -XX:+OptimizeStringConcat
    -XX:+UseTLAB
    -XX:+ResizeTLAB
    -XX:TLABSize=512k
    -XX:+PrintGC
    -XX:+DisableExplicitGC
    -XX:+PrintGCDetails
    -XX:-PrintConcurrentLocks
    -XX:+PrintGCDateStamps
    -XX:+PrintGCTimeStamps
    -XX:+PrintReferenceGC
    -XX:+PrintJNIGCStalls
    -XX:+PrintGCApplicationStoppedTime
    -XX:+CMSScavengeBeforeRemark
    -XX:ConcGCThreads=12
    -XX:ParallelGCThreads=12

    I have 24 CPU Intel(R) Xeon(R) CPU X5650 @ 2.67GHz in my machine, RAM: 24675112 kB JDK version: java version "1.6.0_32"

    ReplyDelete
    Replies
    1. Hi,

      1. I would recommend you to upgrade JVM version, if I remember correctly there were bug in parallel processing of references in CMS prior 6u34

      2. Your case looks like fragmentation issue. So far only two remedies I know for this problem: restart JVM regulary and increase heap size (this would increase mean time to promotion failure)

      Delete
  9. Hi Alexey

    May I ask you which bug you are referring to? I have looked at http://www.oracle.com/technetwork/java/javase/documentation/overview-156328.html, but haven't found anything.

    Kind regards,
    Morten

    ReplyDelete
    Replies
    1. Bug is http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=2207828

      But it looks like I was wrong, it is reported to be fixed at 6u27

      Delete
  10. Hi Alexey,

    We are running into a strange issue:

    2013-09-12T10:08:02.253-0700: 120913.045: [GC Before GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 37142932
    Max Chunk Size: 19517862
    Number of Blocks: 20
    Av. Block Size: 1857146
    Tree Height: 8
    Before GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 3619112
    Max Chunk Size: 3617792
    Number of Blocks: 3
    Av. Block Size: 1206370
    Tree Height: 2
    120913.045: [ParNew120933.884: [SoftReference, 0 refs, 0.0000055 secs]120933.884: [WeakReference, 156 refs, 0.0000244 secs]120933.884: [FinalReference, 6 refs, 0.0000115 secs]120933.884: [PhantomReference, 61 refs, 0.0000094 secs]120933.884: [JNI Weak Reference, 0.0000013 secs]: 59008K->6528K(59008K), 20.8394227 secs] 895484K->846135K(1042048K)After GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 36263548
    Max Chunk Size: 19517862
    Number of Blocks: 19
    Av. Block Size: 1908607
    Tree Height: 8
    After GC:
    Statistics for BinaryTreeDictionary:
    ------------------------------------
    Total Free Space: 3619112
    Max Chunk Size: 3617792
    Number of Blocks: 3
    Av. Block Size: 1206370
    Tree Height: 2
    , 20.8399181 secs] [Times: user=0.70 sys=0.03, real=20.86 secs]

    [Times: user=0.70 sys=0.03, real=20.86 secs] seems to indicate a swapping failure. But we have a Windows 2003 server with 8GB RAM (with PAE) with committed bytes in 28 to 30% range.

    Could you please suggest what could be going wrong here.

    Regards,
    Avinash


    ReplyDelete
    Replies
    1. Hi,

      Looks strange indeed. What is CPU utilization on server during this pause? Is it possible what it starving on CPU.

      Another reason could be some safe point issue, I would recommend you to enable logging safe point statistics.

      Delete
  11. Hi,

    http://www.oracle.co...jsp-135488.html states clearly that G1 is a compacting collector.

    However, Are thee any tools or log options which would allow us to observe memory fragmentation, to ensure that G1 is indeed compacting memory?

    If there is one, how to use to observe memory fragmentation.

    Thanks in advance.

    ReplyDelete