Comments on Alexey Ragozin: Java GC, HotSpot's CMS and heap fragmentation

Hi Alexey, one application leads to 100% CPU, We ...

2015-10-06T12:12:39.663+01:00

Hi Alexey,

one application leads to 100% CPU, We have the following jvm parameters :

-server
-d64
-Xms2048M
-Xmx2048M
-XX:+DisableExplicitGC
-XX:NewSize=512M
-XX:MaxNewSize=512M
-XX:SurvivorRatio=4
-XX:PermSize=256M
-XX:MaxPermSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=65
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+CMSPermGenSweepingEnabled
-XX:MaxTenuringThreshold=30

Thread dump stack shows:

Heap
par new generation total 436928K, used 93221K [0x00000006b0000000, 0x00000006d0000000, 0x00000006d0000000)
eden space 349568K, 21% used [0x00000006b0000000, 0x00000006b484f438, 0x00000006c5560000)
from space 87360K, 21% used [0x00000006c5560000, 0x00000006c681a020, 0x00000006caab0000)
to space 87360K, 0% used [0x00000006caab0000, 0x00000006caab0000, 0x00000006d0000000)
concurrent mark-sweep generation total 4718592K, used 3590048K [0x00000006d0000000, 0x00000007f0000000, 0x00000007f0000000)
concurrent-mark-sweep perm gen total 262144K, used 217453K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)

CMS: abort preclean due to time 2015-09-24T14:16:14.752+0200: 4505865.908: [CMS-concurrent-abortable-preclean: 4.332/5.134 secs] [Times: user=5.22 sys=0.08, real=5.14 secs]
2015-09-24T14:16:14.756+0200: 4505865.912: [GC[YG occupancy: 127725 K (436928 K)]4505865.912: [Rescan (parallel) , 0.0602290 secs]4505865.973: [weak refs processing, 0.0000220 secs] [1 CMS-remark: 3590048K(4718592K)] 3717774K(5155520K), 0.0604150 secs] [Times: user=0.64 sys=0.00, real=0.06 secs]

Can u help me in understanding this issue?

BTW -Xmx should be same as -Xms. That could be one...

2015-03-12T06:08:30.488+00:00

BTW -Xmx should be same as -Xms. That could be one source of problems.

JVM options sounds reasonable in general. It doesn...

2015-03-12T06:04:27.387+00:00

JVM options sounds reasonable in general. It doesn't they are reasonable for your application.
I would suggest you to post this question on StackOvervflow. I would also ask you to add some description of application and GC logs between two full GC

Hi Alexey, We have the following options set: -...

2015-03-11T10:57:02.657+00:00

Hi Alexey,

We have the following options set:
-Xmx50g -server -XX:NewSize=4g -XX:MaxNewSize=4g -Xms40g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=8 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768 -XX:+UseGCTaskAffinity -XX:+BindGCTaskThreadsToCPUs -XX:+CMSScavengeBeforeRemark -XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSClassUnloadingEnabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:PrintFLSStatistics=1

We see CMS running into heap fragmentation issue a few times a day and then CMS takes about 40 seconds to do GC during those times. Are we doing something wrong here?

Thanks a lot for this blog Alexey. I saw few mil...

2014-10-17T16:09:47.424+01:00

Thanks a lot for this blog Alexey.

I saw few million logs in our log file written by JVM "Live object size n".
Below are the flags enabled. Do you know which flag will result in printing millions of "live object sizes" Thanks.

/usr/java/bin/java -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xincgc
-XX:NewSize=256m -XX:MaxNewSize=256m
-XX:-UseBiasedLocking -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:PrintCMSStatistics=1
-XX:+PrintTenuringDistribution
-XX:+PrintReferenceGC
-XX:+PrintJNIGCStalls
-XX:+PrintCMSInitiationStatistics
-XX:+CMSDumpAtPromotionFailure
-XX:+PrintPromotionFailure
Logs:

0x000000077f200278: live object of size 5

0x000000077f2002a0: live object of size 4

I mean your Xmx should be large enough to accommod...

2014-08-19T06:39:57.403+01:00

I mean your Xmx should be large enough to accommodate both objects and work buffer for CMS.

-XX:CMSInitiatingOccupancyFraction cotrols when old GC cycle starts. But if old space is full of live object it will not free much.

What do you mean by provide enough buffer room to ...

2014-08-16T05:43:52.878+01:00

What do you mean by provide enough buffer room to CMS ? Are you referring by having lesser fraction we will give enough buffer room to CMS ? i.e -XX:CMSInitiatingOccupancyFraction=30

Hi, http://www.oracle.co...jsp-135488.html states...

2014-03-07T02:49:28.252+00:00

Hi,

http://www.oracle.co...jsp-135488.html states clearly that G1 is a compacting collector.

However, Are thee any tools or log options which would allow us to observe memory fragmentation, to ensure that G1 is indeed compacting memory?

If there is one, how to use to observe memory fragmentation.

Thanks in advance.

Hi, Looks strange indeed. What is CPU utilization...

2013-09-13T09:43:45.269+01:00

Hi,

Looks strange indeed. What is CPU utilization on server during this pause? Is it possible what it starving on CPU.

Another reason could be some safe point issue, I would recommend you to enable logging safe point statistics.

Hi Alexey, We are running into a strange issue: ...

2013-09-12T22:39:22.061+01:00

Hi Alexey,

We are running into a strange issue:

2013-09-12T10:08:02.253-0700: 120913.045: [GC Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 37142932
Max Chunk Size: 19517862
Number of Blocks: 20
Av. Block Size: 1857146
Tree Height: 8
Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 3619112
Max Chunk Size: 3617792
Number of Blocks: 3
Av. Block Size: 1206370
Tree Height: 2
120913.045: [ParNew120933.884: [SoftReference, 0 refs, 0.0000055 secs]120933.884: [WeakReference, 156 refs, 0.0000244 secs]120933.884: [FinalReference, 6 refs, 0.0000115 secs]120933.884: [PhantomReference, 61 refs, 0.0000094 secs]120933.884: [JNI Weak Reference, 0.0000013 secs]: 59008K->6528K(59008K), 20.8394227 secs] 895484K->846135K(1042048K)After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 36263548
Max Chunk Size: 19517862
Number of Blocks: 19
Av. Block Size: 1908607
Tree Height: 8
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 3619112
Max Chunk Size: 3617792
Number of Blocks: 3
Av. Block Size: 1206370
Tree Height: 2
, 20.8399181 secs] [Times: user=0.70 sys=0.03, real=20.86 secs]

[Times: user=0.70 sys=0.03, real=20.86 secs] seems to indicate a swapping failure. But we have a Windows 2003 server with 8GB RAM (with PAE) with committed bytes in 28 to 30% range.

Could you please suggest what could be going wrong here.

Regards,
Avinash

Bug is http://bugs.sun.com/bugdatabase/view_bug.do...

2013-09-03T19:43:15.891+01:00

Bug is http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=2207828

But it looks like I was wrong, it is reported to be fixed at 6u27

Hi Alexey May I ask you which bug you are referri...

2013-09-03T15:09:09.784+01:00

Hi Alexey

May I ask you which bug you are referring to? I have looked at http://www.oracle.com/technetwork/java/javase/documentation/overview-156328.html, but haven't found anything.

Kind regards,
Morten

Hi, 1. I would recommend you to upgrade JVM versi...

2013-08-31T22:19:32.132+01:00

Hi,

1. I would recommend you to upgrade JVM version, if I remember correctly there were bug in parallel processing of references in CMS prior 6u34

2. Your case looks like fragmentation issue. So far only two remedies I know for this problem: restart JVM regulary and increase heap size (this would increase mean time to promotion failure)

Alexey, I am having the both errors on my applicat...

2013-08-28T04:07:09.203+01:00

Alexey, I am having the both errors on my application: concurrent mode failure & Promotion failed.

Here is my log: 2013-08-27T16:59:53.952-0500: 201564.214: [GC 201564.214: [ParNew202182.528: [SoftReference, 0 refs, 0.0000110 secs]202182.528: [WeakReference, 1015 refs, 0.0001720 secs]202182.528: [FinalReference, 1625 refs, 0.0025210 secs]202182.531: [PhantomReference, 1 refs, 0.0000050 secs]202182.531: [JNI Weak Reference, 0.0000100 secs] (promotion failed): 3774912K->3774912K(3774912K), 619.3859950 secs]202183.600: [CMS CMS: abort preclean due to time 2013-08-27T17:10:14.937-0500: 202185.199: [CMS-concurrent-abortable-preclean: 7.271/626.671 secs] [Times: user=699.20 sys=98.95, real=626.55 secs]
(concurrent mode failure)202189.179: [SoftReference, 614 refs, 0.0000710 secs]202189.179: [WeakReference, 5743 refs, 0.0007600 secs]202189.180: [FinalReference, 3380 refs, 0.0004430 secs]202189.180: [PhantomReference, 212 refs, 0.0000260 secs]202189.180: [JNI Weak Reference, 0.0000180 secs]: 8328471K->4813636K(8388608K), 18.5940310 secs] 11323577K->4813636K(12163520K), [CMS Perm : 208169K->208091K(346920K)], 637.9802050 secs] [Times: user=705.47 sys=98.79, real=637.85 secs]
Total time for which application threads were stopped: 637.9820120 seconds

Could you advise for the jvm parameter tuning?

Here my server information:
-Xmx12288m
-Xms12288m
-XX:MaxPermSize=1024m
-XX:MaxNewSize=4096m
-XX:InitialCodeCacheSize=128m
-XX:ReservedCodeCacheSize=256m
-XX:+UseCodeCacheFlushing
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+CMSClassUnloadingEnabled
-XX:+UseCodeCacheFlushing
-XX:+OptimizeStringConcat
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:TLABSize=512k
-XX:+PrintGC
-XX:+DisableExplicitGC
-XX:+PrintGCDetails
-XX:-PrintConcurrentLocks
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintReferenceGC
-XX:+PrintJNIGCStalls
-XX:+PrintGCApplicationStoppedTime
-XX:+CMSScavengeBeforeRemark
-XX:ConcGCThreads=12
-XX:ParallelGCThreads=12

I have 24 CPU Intel(R) Xeon(R) CPU X5650 @ 2.67GHz in my machine, RAM: 24675112 kB JDK version: java version "1.6.0_32"

You seems to be allocating pretty large objects (y...

2012-09-12T15:23:52.735+01:00

You seems to be allocating pretty large objects (your log references 4MiB and 8MiB block allocation failure). Free lists are not covering such large objects.

I have two guesses:
1. You have very big HashMap or several (couple millions of keys) and failure is causes by hash table resize.
2. You are allocating large arrays frequently.

In first key, you should reserve enough capacity in HashMap on creation to avoid further resizes.
In second case, only increasing heap size would probably help.

Any way analyzing heap dump should help you understand your memory occupation.

Hi Alexey, I have gone through your blogs and they...

2012-09-12T04:51:26.053+01:00

Hi Alexey,
I have gone through your blogs and they are an excellent source of rich information on GC patterns in Java. We are running into to some serious issues with the GCs in our production environment where we are seeing numerous, back to back promotion failures. We are using JDK 1.6.0_24 and below are the JVM settings. I am desperate for any help you can provide. Thanks in advance!

- Vasanth

-XX:PermSize=256m -XX:MaxPermSize=256m -XX:NewSize=512m -XX:MaxNewSize=512m -XX:SurvivorRatio=8 -XX:+DisableExplicitGC -Xms2560m -Xmx2560m -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:ParallelGCThreads=12 -Dweblogic.SocketReaders=12 -XX:-UseBiasedLocking -XX:LargePageSizeInBytes=128m -XX:-UseThreadPriorities -Dnetworkaddress.cache.ttl=180 -XX:+PrintGCTimeStamps -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1

Excerpts from GC logs showing promotion failures:

47966.117: [ParNew (9: promotion failure size = 4997124) (promotion failed): 455476K->455653K(471872K), 0.0703983 secs]47966.188: [CMSCMS: Large block 0x8af1b6f0
: 1221978K->515117K(2097152K), 15.4494470 secs] 1667493K->515117K(2569024K), [CMS Perm : 177716K->174905K(262144K)]After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 404984388
Max Chunk Size: 404984388
Number of Blocks: 1
Av. Block Size: 404984388
Tree Height: 1
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
Tree Height: 0
, 15.5235571 secs] [Times: user=17.57 sys=0.06, real=15.52 secs]

48225.553: [ParNew (10: promotion failure size = 8083460) (promotion failed): 457342K->457341K(471872K), 0.0770736 secs]48225.630: [CMSCMS: Large block 0xd89239b8
48226.011: [CMS-concurrent-sweep: 2.528/2.825 secs] [Times: user=7.61 sys=0.10, real=2.83 secs]
(concurrent mode failure)CMS: Large block 0x8afc35d8
: 1701999K->515789K(2097152K), 14.7651760 secs] 2143373K->515789K(2569024K), [CMS Perm : 175107K->175061K(262144K)]After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 404812426
Max Chunk Size: 404812426
Number of Blocks: 1
Av. Block Size: 404812426
Tree Height: 1
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
Tree Height: 0
, 14.8443607 secs] [Times: user=15.40 sys=0.04, real=14.85 secs]

It was really nice to study your post. I collect s...

2012-07-07T18:04:58.792+01:00

It was really nice to study your post. I collect some good points here. I would like to be appreciative you with the hard work you have made in skill this is great article.

Hello Alexey -- Thanks for the blogs. I read this...

2011-12-06T03:09:29.848+00:00

Hello Alexey -- Thanks for the blogs.
I read this article further, I think it's great.

Ramki, Thank you, for clarification. I will updat...

2011-10-20T12:20:05.084+01:00

Ramki,

Thank you, for clarification. I will update post, but want spent sometime to better understand CMS allocation buffers fisrt.

In mean time, I have published results from my tests http://aragozin.blogspot.com/2011/10/cms-heap-fragmentation-follow-up-1.html

Hi Alexey -- Thanks for confirming that the fragme...

2011-10-19T20:41:45.828+01:00

Hi Alexey -- Thanks for confirming that the fragmentation is worse with use of "PLABs" in
the old generation, even after the fix for 6999988.

I'd like to point out though, one small inaccuracy in your description of the use of "PLABs" for the CMS old space when allocating small blocks. As you know, the small blocks come from per-size free lists indexed by size. This pool is, of course, populated, when empty by means of allocating a large block that is a multiple of the small block size list that we are trying to allocate out of and carving it into a number of smaller blocks. However, this populating of free lists is orthogonal to the PLAB allocation in the sense that some subset of these are then removed from the global pool and added to the per-thread cache. This cache of free blocks of various sizes constitutes the PLAB from which a specific thread will allocate. The number of blocks that will be fetched into this PLAB is dynamically adjusted based on perceived recent demand, suitably averaged over the set of GC worker threads, and low-pass filtered to avoid large fluctuations. Thus, in steady state conditions the number that are fetched into the local cache should closely track the number allocated in each scavenge. By setting OldPLABSize to 16 and turning off dynamic resizing of this parameter, one could lead to excess inventory lying around in the caches and the global free pool.

One final note. OldPLABSize is interpreted in CMS as not the size of the PLAB in words, but rather the number of free blocks of a given size that are used when replenishing the local per-worker free list caches.

I agree that any changes in these parameters should be done with the utmost care for the reasons you provide. We have a number of customers who saw improved scavenge times as well as improved fragmentation behaviour as a result of the changes in 6631166+6699988.

Thanks again for your well thought out and well-written blogs!

-- ramki

Hi Ramki, Very interesting note. I was using JDK ...

2011-10-19T01:12:09.827+01:00

Hi Ramki,

Very interesting note.
I was using JDK 6u26 in may tests. I have repeated my test with 6u24. Indeed, it have shown much more frequent promotion failure events compared to 6u26.

Regards,
Alexey

Hi Alexey -- Thanks for the blogs. Which JVM versi...

2011-10-18T19:34:38.874+01:00

Hi Alexey -- Thanks for the blogs. Which JVM version were you using in your experiments? Are you aware of Sun CR 6999988 ? As far as I can tell, this is fixed in 6u25 / hs20 for the first time. The existence of that bug led to all of the "sophisticated machinery" introduced in 6631166 being ineffective.

thanks.
-- ramki