Term Stop-the-World pause is usually associated with garbage collection. Indeed GC is a major contributor to STW pauses, but not the only one.
Safepoints
In HotSpot JVM Stop-the-World pause mechanism is called safepoint. During safepoint all threads running java code are suspended. Threads running native code may continue to run as long as they do not interact with JVM (attempt to access Java objects via JNI, call Java method or return from native to java, will suspend thread until end of safepoint).
Stopping all threads are required to ensure what safepoint initiator have exclusive access to JVM data structures and can do crazy things like moving objects in heap or replacing code of method which is currently running (On-Stack-Replacement).
How safepoints work?
Safepoint protocol in HotSpot JVM is collaborative. Each application thread checks safepoint status and park itself in safe state in safepoint is required.
For compiled code, JIT inserts safepoint checks in code at certain points (usually, after return from calls or at back jump of loop). For interpreted code, JVM have two byte code dispatch tables and if safepoint is required, JVM switches tables to enable safepoint check.
Safepoint status check itself is implemented in very cunning way. Normal memory variable check would require expensive memory barriers. Though, safepoint check is implemented as memory reads a barrier. Then safepoint is required, JVM unmaps page with that address provoking page fault on application thread (which is handled by JVM’s handler). This way, HotSpot maintains its JITed code CPU pipeline friendly, yet ensures correct memory semantic (page unmap is forcing memory barrier to processing cores).
When safepoints are used?
Below are few reasons for HotSpot JVM to initiate a safepoint:
- Garbage collection pauses
- Code deoptimization
- Flushing code cache
- Class redefinition (e.g. hot swap or instrumentation)
- Biased lock revocation
- Various debug operation (e.g. deadlock check or stacktrace dump)
Trouble shooting safepoints
Normally safepoints just work. Thus, you can care less about them (most of them, except GC ones, are extremely quick). But if something can break it will break eventually, so here is useful diagnostic:
- -XX:+PrintGCApplicationStoppedTime – this will actually report pause time for all safepoints (GC related or not). Unfortunately output from this option lacks timestamps, but it is still useful to narrow down problem to safepoints.
- -XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1 – this two options will force JVM to report reason and timings after each safepoint (it will be reported to stdout, not GC log).
References
· How does JVM handle locks – quick info about biased locking

Nice article Alexey. Do you know what part of the JDK code really does this - "JVM unmaps page with that address provoking page fault on application thread"?
ReplyDeleteI think the Azul JVM also used to do this to quickly trap moved/GC'ed addresses.
Using page faults for read barrier ("quickly trap moved/GC'ed addresses") would be prohibitively expensive. Azul JVM does not use page faults for read barrier, though it is using this technique for defragmenting physical memory associated with large object.
DeleteAzul is using custom page mapping to facilitate software read barrier, but this technique does not relay on page faults.
Or at least it was that way last time I was working with Azul.
Thank you for an eye-opening article on safepoints. Do you know if there is any way to identify the reason for a huge pause of hundreds of seconds that does not appear to be related to GC activity?
ReplyDeleteTotal time for which application threads were stopped: 0.0020916 seconds
Total time for which application threads were stopped: 0.0677614 seconds
Total time for which application threads were stopped: 0.0016208 seconds
Total time for which application threads were stopped: 195.2580105 seconds
Total time for which application threads were stopped: 0.0313111 seconds
Total time for which application threads were stopped: 0.0005465 seconds
Total time for which application threads were stopped: 0.0006269 seconds
First enable safe point logging -XX:+PrintSafepointStatistics
Delete-XX:PrintSafepointStatisticsCount=1
This will allow you to understand whenever safepoint is culprit.
Last problem with slow safepoints, was bug in JIT combined with weird application code.
Trying latest JVM is another step.
We switched to 1.6.0_43, at the time that happened we had 1.6.0_31. One of the reasons was bug 2221291. Can you tell me the bug ID for the problem related to JIT?
ReplyDeleteNo, I didn't track exact bug. Slight change of code has solved issue in my case.
DeleteYep, 2221291 is a nasty one.