Wednesday, June 1, 2011

Understanding GC pauses in JVM, HotSpot's minor GC.

Stop the world pauses of JVM due to work of garbage collector are known foes of java based application. HotSpot JVM has a set of very advanced and tunable garbage collectors, but to find optimal configuration it is very important to understand an exact mechanics garbage collection algorithms. This article one of series explaining how exactly GC spends our precious CPU cycle during stop-the-world pauses. An algorithm for young space garbage collection in HotSpot is explained in this issue.

Structure of heap

Most of modern GCs are generational. That means java heap memory is separated into few spaces. Spaces are usually distinguished by “age” of objects. Objects are allocated in young space, then, if they survive long enough, eventually promoted to old (tenured) space. That approach rely on hypothesis that most object “die young”, i.e. majority of objects are becoming garbage shortly after being allocated. All HotSpot garbage collectors are separating memory into 5 spaces (though for G1 collector spaces may not be continuous).

  • Eden are space there objects are allocated,
  • Survivor spaces are used to receive object during young (or minor GC),
  • Tenured space is for long lived objects,
  • Permanent space is for JVM own objects (like classes and JITed code), it is behaves very like tenured space so we will ignore it for rest of article.
Eden and 2 survivor spaces together are called young space.

HotSpot GC algorithms overview

HotSpot JVM is implementing few algorithms for GC which are combined in few possible GC profiles.
  • Serial generational collector (-XX:+UseSerialGC).
  • Parallel for young space, serial for old space generational collector (-XX:+UseParallelGC).
  • Parallel young and old space generational collector (-XX:+UseParallelOldGC).
  • Concurrent mark sweep with serial young space collector (-XX:+UseConcMarkSweepGC
    –XX:-UseParNewGC
    ).
  • Concurrent mark sweep with parallel young space collector (-XX:+UseConcMarkSweepGC –XX:+UseParNewGC).
  • G1 garbage collector (-XX:+UseG1GC).
All profiles except G1 are using almost same young space collection algorithms (with serial vs parallel variations).

Write barrier

Key point of generational GC is what it does need to collect entire heap each time, but just portion of it (e.g. young space). But to achieve this JVM have to implement special machinery called “write barrier”. There 2 types of write barriers implemented in HotSpot: dirty cards and snapshot-at-the-beginning (SATB). SATB write barrier is used in G1 algorithms (which is not covered in this article). All other algorithms are using dirty cards.

Dirty cards write-barrier (card marking)

Principle of dirty card write-barrier is very simple. Each time when program modifies reference in memory, it should mark modified memory page as dirty. There is a special card table in JVM and each 512 byte page of memory has associated byte in card table.

Young space collection algorithm

Almost all new objects (there are few exception when new object can be allocated directly in old space) are allocated in Eden space. To be more effective HotSpot is using thread local allocation blocks (TLAB) for allocation of new objects, but TLAB themselves are allocated in Eden. Once Eden becomes full minor GC is triggered. Goal of minor GC is to clear fresh garbage in Eden space. Copy-collection algorithm is used (live objects are copied to another space, and then whole space is marked as free memory). But before start collecting live objects, JVM should find all root references. Root references for minor GC are references from stack and all references from old space.
Normally collection of all reference from old space will require scanning through all objects in old space. That is why we need write-barrier. All objects in young space have been created (or relocated) since last reset of write-barrier, so non-dirty pages cannot have references into young space. This means we can scan only object in dirty pages.

Once initial reference set is collected, dirty cards are reset and JVM starts coping of live objects from Eden and one of survivor spaces into other survivor space. JVM only need to spend time on live objects. Relocating of object also requires updating of references pointing to it.
While JVM is updating references to relocated object, memory pages get marked again, so we can be sure what on next young GC only dirty pages has references to young space.
Finally we have Eden and one survivor space clean (and ready for allocation) and one survivor space filled with objects.

Object promotion

If object is not cleared during young GC it will be eventually copied (promoted) to old space. Promotion occurs in following situations:
  • -XX:+AlwaysTenure  makes JVM to promote objects directly to old space instead of survivor  space (survivor spaces are not used in this case).
  • once survivor space is full, all remaining live object are relocated directly to old space.
  • If object has survived certain number of young space collections, it will be promoted to old space (required number of collections can be adjusted using –XX:MaxTenuringThreshold option and –XX:TargetSurvivorRatio JVM options).

Allocation of new objects in old space

It would be beneficial if we could possibly allocate long lived objects directly in old space. Unfortunately there is no way to instruct JVM to do this for particular object. But there are few cases when object can be allocated directly in old space.
  • Option -XX:PretenureSizeThreshold=<n> instructs JVM what all objects larger  than <n> bytes should be allocated directly in old space (though if object size fits TLAB, JVM will allocate it in TLAB and thus young space, so you should also limit TLAB size).
  • If object is larger than size of Eden space it also will be allocated in old space.
Unlike application objects, system objects are always allocated by JVM directly in permanent space.

Parallel execution

Most of task during young space collection can be done in parallel. If there are several CPUs available, JVM can utilize them to compress duration of stop-the-world pause during collection. Number of threads can be configured in HotSpot JVM by –XX:ParallelGCTreads= parameter. By default JVM will choose number of thread by number of available CPU. As expected, serial version of collector will ignore this parameter because it can use only one CPU. Using parallel collection reduces time of stop-the-world pause by factor close to number of physical cores.

Time of young GC

Young space collection happens during stop-the-world pause (all non-GC-related threads in JVM are suspended). Wall clock time of stop-the-world pause is very important factor for applications (especially applications requiring fast response time). Parallel execution affects wall clock time of pause but not work effort to be done.
Let’s summarize components of young GC pause. Total pause time can be written as:
Tyoung = Tstack_scan + Tcard_scan + Told_scan+ Tcopy ; there Tyoung is total time of young GC pause, Tstack_scan is time to scan root in stacks, Tcard_scan is time to scan card table to find dirty pages, Told_scan  is time to scan roots in old space and Tcopy is time to copy live objects (1).
Thread stack are usually very small, so major factors affecting time of young GC is Tcard_scan , Told_scan and Tcopy.
Another important parameter is frequency of young GC. Period between young collections is mainly determined by application allocation rate (bytes per second) and size of Eden space.
Pyoung = Seden / Ralloc ; there Pyoung  is period between young GC, Seden is size of Eden and Ralloc is rate of memory allocations (bytes per second) (2).
Tstack_scan – can be considered application specific constant.
Tcard_scan – is proportional to size of old space. Literally, JVM have to check single byte of table for each 512 bytes of heap (e.g. 8G of heap -> 16m to scan).
Told_scan  – is proportional to number of dirty cards in old space at the moment of young GC. If we assume that references to young space are distributed randomly in old space, then we can provide following formula for time of old space scanning.
; there Sold is size of old space and D, kcard and ncard are coefficients specific for application (3).


Tcopy – is proportional to number of live objects in heap. We can approximate it by formula:






There kcopy is effort to copy object, Rlong_live is rate of allocation of long lived objects,  


(ksurvive usually very small), and ktenure is a coefficient to approximate aging of object in young space before tenuring (ktenure ≥ 1) (4).
Now we can analyze how various JVM options may affect time and frequency of young GC.

Size of old space.
Size of old space is affecting Tcard_scan and Told_scan part of young GC pause time according to formulas above. So we as we are increasing size of old space (read total heap size) time of young GC pauses will grow and it cannot be helped. After certain size of heap (usually 4-8 Gb) time of young collection is dominated by Tcard_scan (technically Tcopy can be even greater than Tcard_scan , but it usually can be controlled by tuning of other GC options).
HotSpot JVM options:
-Xmx=<n> -Xms=<n>

Size of Eden space 
Period between young GC is proportional to size of Eden. Tcopy is also proportional to size of eden but in practice ksurvive can be so small that for some applications we can forget about Tcopy. Unfortunately time between young GC will also affect coefficient D in equation (4). Though dependency between D and Pyoung is very application specific, increasing Pyoung will increase D and as a consequence Tscan_old.
HotSpot JVM options:
-XX:NewSize=<n> -XX:MaxNewSize=<n>

Size of survivor space. 
Size of survivor space puts hard limit of how much objects can stay in young space between collections. Changing size of survivor space may affect ktenure (or mat not, e.g. if ktenure is already 1).
HotSpot JVM options:
-XX:SurviorRatio=<n>

Max tenuring threshold and target survivor ratio 
These two JVM options also allow artificially adjust ktenure.
HotSpot JVM options:
-XX:TargetSurviorRatio=<n>
‑XX:MaxTenuringThreshold=<n>
-XX:+AlwaysTenure
-XX:+NeverTenure

Pretenuring threshold 
For certain applications using pretenuring threshold could reduce ksurvive due to allocation of long lived object directly in old space.
HotSpot JVM options:
-XX:PretenureThreshold=<n>

See also

Other articles about garbage collection in this blog

35 comments:

  1. Your blog is the only article I found that explained how the time was spent in garbage collection, very informational.

    I'm trying to make sense of a performance issue that I'm facing. Our web app only normally uses around 1G memory, and the young GC takes less 1 second. However, every now and then there is some periods that young GC continuously takes several minutes to work with the 380M young generation, regardless how much memory is used in the old generation. We have 12G heap in total, and use ParNew and CMS for garbage collection. By default the app starts with only about 19M young generation, and only expands it to 380M after a promotion failure. I tried different garbage collection mode and parameters, including throughput collector, but the app always has trouble in GC for some periods. There is little clue I found from the heap dump. What may slow down GC so dramatically? Thanks.

    ReplyDelete
    Replies
    1. Hi,

      Young collection taking a minute is definitely abnormal.

      One common reason for abnormally long GC pauses is swapping. Though one minute is too long to be explained by swapping.

      Other reason my be JNI, try to enable -XX:+PrintJNIGCStalls to get diagnostics on that. JNI allows native code to pin large java object in address space for limited time, GC cannot start until native code unpin all java objects.

      Regards

      Delete
    2. Thank you for the quick response. It took some time to reproduce this with the new GC parameters, but I didn't see any stall in the log. However, enabling reference printing got more information in the log:
      85405.788: [GC 85405.788: [ParNew85405.879: [SoftReference, 0 refs, 0.0000440 secs]85405.879: [WeakReference, 1007 refs, 0.0002630 secs]85405.879: [FinalReference, 42 refs, 0.0002240 secs]85405.879: [PhantomReference, 0 refs, 0.0000220 secs]85405.879: [JNI Weak Reference, 145.1870810 secs]: 375695K->42560K(383424K), 145.2780270 secs] 7652920K->7319797K(11463104K), 145.2783020 secs] [Times: user=142.14 sys=0.05, real=145.25 secs]

      Does [JNI Weak Reference, 145.1870810 secs] mean so much spent in collecting JNI weak references?

      Delete
    3. That is it. JVM were processing JNI weak references for 145 sec. Number looks totally unreasonable though.
      I do not know much about JNI weak references.
      hotspot-gc-use-request@openjdk.java.net mailing list should be a right place to continue investigating.

      Delete
    4. You just found the reason not to use weak references. Get it out of your code.

      Delete
  2. Very informative. Thank You!!

    Manish

    ReplyDelete
  3. One of the best articles I have come across..

    ReplyDelete
  4. In "HotSpot GC algorithms overview", you say:
    Concurrent mark sweep with serial young space collector (-XX:+UseConcMarkSweepGC –XX:-UseParNewGC).

    Concurrent mark sweep with parallel young space collector (-XX:+UseConcMarkSweepGC –XX:+UseParNewGC).

    Which one is the correct? And what parameters do we need for the other?

    ReplyDelete
    Replies
    1. Take a close look, it is a different lines (hint "-XX:+" - enables "-XX:-" disables option).

      Delete
  5. You said:

    If object has survived certain number of young space collections, it will be promoted to old space.

    However, according to my observation, objects won't be promoted until it hits the Desired survivor size limit.

    When I checked OpenJDK 7, I saw this:
    http://hg.openjdk.java.net/jdk7/hotspot/hotspot/raw-file/92da084fefc9/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp

    // Try allocating obj in to-space (unless too old)
    if (dummyOld.age() < tenuring_threshold()) {
    new_obj = (oop)par_scan_state->alloc_in_to_space(sz);
    if (new_obj == NULL) {
    set_survivor_overflow(true);
    }
    }

    In my test, tenuring_threshold is almost always 16, while the max age is 15, and since it had plenty of room in survivor space, allocation for new_obj succeeded most of the time. So I believe this is the place the GC will go:

    } else {
    // Is in to-space; do copying ourselves.
    Copy::aligned_disjoint_words((HeapWord*)old, (HeapWord*)new_obj, sz);
    forward_ptr = old->forward_to_atomic(new_obj);
    // Restore the mark word copied above.
    new_obj->set_mark(m);
    // Increment age if obj still in new generation
    new_obj->incr_age();
    par_scan_state->age_table()->add(new_obj, sz);
    }

    For new_obj->set_mark(m), I guess this is the final code for execution:

    http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/9b0ca45cd756/src/share/vm/oops/markOop.hpp

    markOop incr_age() const { return age() == max_age ? markOop(this) : set_age(age() + 1); }

    So the age will stop growing when it reaches max_age.

    Without enough knowledge, I don't know how max_age is set. But I would like to guess that max_age was 15 in my test, so that all those objects will remain in survivor space for a very long time, say 3 hours or more.

    However, in my application, I really want those object promote to Old Gen. Keeping them in the Survivor space hurt the performance, because every minor GC needs to copy all those very old objects again and again, not sure about the scanning.

    I also read "OpenJDK patch cutting down GC pause duration up to 8 times". I am wondering whether this patch will help improving the performance for case like mine. I am using Oracle Java 7 up 09, not sure whether it contains you patch or not.

    Thank you very much for all of your effort.

    ReplyDelete
    Replies
    1. Hi,

      1. Max age cannot exceed 15 because it is a 4 bit field (that may not be true for 64 bit JVM, not sure about it).
      2. Provided max age is 15 - TenuringThreshold = 16 means keep object in young space until overflow.
      3. JVM adjusts TenuringThreshold to prevent survival space overflow and usually it is far less than 16

      Regards,
      Alexey

      Delete
    2. Thank you, Alexey.

      We are using 64 bit JVM. From GC log, I can see this:

      Desired survivor size 536870912 bytes, new threshold 16 (max 31)

      So I assume Max age is 31 instead of 15 in 64 bit JVM.

      My problem with Minor GC is that it won't overflow, so that some very old objects are kept being scanned again and again, for up to hours.

      Thanks,
      Jeff

      Delete
    3. In this case you should either shrink survivor space (-XX:SurvivorRatio=N) or use -XX:MaxTenuringThreshold=N to set max age for object in young space.

      Delete
  6. Great article. Thanks for putting this together!

    ReplyDelete
  7. Hi Alexey,

    Great article. the only best article i found on net. just want to know what is the default collectors for each generation in java 7?

    Thanks

    ReplyDelete
    Replies
    1. Default GC algo depends on -client / -server flags and some hardware heuristics.

      Most likely it would parallel scavenge + parallel old combination.

      But instead of guessing you can easily check in JConsole.

      Delete
  8. Hi Alexey,

    thanks for great explanation. Could you clarify a few points for me please?
    1) After young gc is complete, are all pages in the card table "clean"? Or are some pages still dirty, because they still point to objects from young generation?
    2) When objects are relocated or promoted, how does JVM know, which references point to a particular object. Does an object maintain a list of objects it is referenced from?

    ReplyDelete
    Replies
    1. I'll answer second question first

      2) If object is relocated/promoted its new address is stored at old location (at object header). During GC EVERY reference pointing to young space is processed and rewritten with new object location.

      1) There are two activities related to card table happen during young GC. Reference scanning tasks are clearing cards as memory is being scanned. Though if reference to object is rewritten (due to its relocation), card is marked dirty again. Every live object form young space is being relocated, so right after young GC, every memory page containing references to survived young objects is marked as dirty.

      Delete
  9. Hi Alexey,
    Thanks for the article. I have one confusion here.

    After reading this
    "Normally collection of all reference from old space will require scanning through all objects in old space. That is why we need write-barrier. All objects in young space have been created (or relocated) since last reset of write-barrier, so non-dirty pages cannot have references into young space. This means we can scan only object in dirty pages."

    I am unable to understand why we have Tcard_scan as well as Told_scan while calculating time for young collection.

    Do you mean first look to card table and find a old object and then do old scan to find references of that old object to young gen object ?

    ReplyDelete
    Replies
    1. GC threads are scanning card table. If dirty flag is found in card table, corresponding page of old space should be scanned to find actual references to young objects.

      Delete
  10. Very good article.

    We can use some tools to measure the young GC time and full GC time. Can I ask how can we decompose the young GC time into details? Using which parameter or what tools or anything else? Thanks.


    ReplyDelete
  11. I reality card scanning and object relocation are running in parallel (once thread encounters next dirty card it immediately scans page and process all young reference recursively), so you cannot time them separately.

    ReplyDelete
  12. How are young to old cross generation references handled ? Does the old gen gc follow the same card marking technique ?

    ReplyDelete
    Replies
    1. All references from young to old are considered GC roots for old collection.

      Delete
  13. Hi Alexey,
    Thanks for the article. Could you tell me when minor gc will cause major gc ?

    ReplyDelete
    Replies
    1. It depends on GC type. In MSC full collection is triggered if free space in old gen below projected promotion volume. In CMS, there is occupancy threshold, by promoting objects young GC increasing old space occupancy and my trigger old collection cycle. In G1, there are no old space collections, mixed collection collect young space and portion of old space (full GC may happen as resort though).

      Delete
  14. As it shows above(Writer Barrier), 512byte is the threshold value of memory block. If i have an object (2MB), what can JVM do about it?

    ReplyDelete
    Replies
    1. If it is regual Java object page of first object's byte is marked. If it is Object[] page containing element being written is marked.

      Delete
    2. Thank you, Alexey!

      Before i asked the question last night, i have misunderstood this topic. "512 bytes" should be the threshold value of memory page which is aimed to record the memory size of Regular Java object. So, i think 512 bytes is enough for recording.

      Delete
    3. There are no pages. Each page is just DIRTY/CLEAN flag value in card table. There DIRTY means - corresponding 512 bytes of heap may contain object reference modified since last card table reset.

      Delete
  15. Hi. I have a question. When object gets to old from young it is marked as Dirty card (right?). Let's suppose that:
    1. Object A is live and is reffering to Object B (in young collection)
    2. After GC Object A is promoted to old (so is marked as dirty card).
    3. Object A is not modified, but because it was marked as dirty card, Object B is still live, because of Object A.
    4. Another GC -> Cards are reset.
    5. Object A was not modified, so is not marked as dirty card.
    6. Object B is deleted?

    Are old object marked somehow if they were found to have a reference to young object?

    Or maybe when objects are copied to one of survivor space, the references to them were changed, therefore objects in old generation were modified and marked as dirty cards?

    ReplyDelete
    Replies
    1. At step (4) all objects in young space are physically moved to surviviour space, so object A would be updated with new address of B and thus marked as dirty.

      Delete
  16. Great article! I have a question. If card table is just used to mark a 512Bytes page is dirty or clean, a bit is enough, why a byte is needed?

    ReplyDelete
    Replies
    1. Good questions. I do not know exact answer. Some implementations are using bitmap for that purpose.
      Though, I think bit manipulation would require more instructions for write barrier. In addition, I believe extra values of byte may be used during CMS concurrent preclean phase.

      Delete