Sunday, January 24, 2016

Flame Graphs Vs. Cold Numbers

Stack trace sampling is very powerful technique for performance troubleshooting. Advantages of stack trace sampling are

  • it doesn't require upfront configuration
  • cost added by sampling is small and controllable
  • it is easy to compare analysis result from different experiments

Unfortunately, tools offered for stack trace analysis by widespread Java profilers are very limited.

Solving performance problem in complex applications (a lot of business logic etc) is one of my regular challenges. Let's assume I have another misbehaving application at my hands. First step would be to localize bottleneck to specific part of stack.

Meet call tree

Call tree is built by digesting large number of stack traces. Each node in tree has a frequency - number of traces passing though this node.

Usually tools allow you to navigate through call tree reconstructed from stack trace population.

There is also flame graphs visualization (shown at right top of page) which is fancier but is just the same tree.

Looking at these visualization what can I see? - Not too much.

Why? Business logic somewhere in the middle of call tree produces too many branches. Tree beneath business logic is blurred beyond point of usability.

Dissecting call tree

Application is build using frameworks. For the sake of this article, I'm using example based on JBoss, JSF, Seam, Hibernate.

Now, if 13% of traces in our dump contain JDBC we can conclude what 13% of time is spent in JDBC / database calls.
13% is reasonable number, so database is not to blame here.

Let's go down the stack, Hibernate is next layer. Now we need to calculate all traces containing Hibernate classes excluding ones containing JDBC. This way we can attribute traces to particular framework and quickly get a picture where time is spent at runtime.

I didn't find any tool that can do it kind of analysis for me, so I build one for myself few years ago. SJK is my universal Java troubleshooting toolkit.

Below is command doing analysis explained above.

sjk ssa -f tracedump.std  --categorize -tf **.CoyoteAdapter.service -nc
"Facelets compile=com.sun.faces.facelets.compiler.Compiler.compile"
"Seam bijection=org.jboss.seam.**.aroundInvoke/!**.proceed"

Below is output of this command.

Total samples    2732050 100.00%
JDBC              405439  14.84%
Hibernate         802932  29.39%
Facelets compile  395784  14.49%
Seam bijection    385491  14.11%
JSF.execute       290355  10.63%
JSF.render        297868  10.90%
Other             154181   5.64%

Well, we clearly see a large amount of time spent in Hibernate. This is very wrong, so it is first candidate for investigation. We also see that a lot of CPU is spent on JSF compilation, though pages should be compiled just once and cached (it turned out to be configuration issue). Actual application logic falls in JFS life cycle calls (execute(), render()). I would be possible to introduce additional category to isolate pure application logic execution time, but looking at numbers, I would say it is not necessary until other problems are solved.

Hibernate is our primary suspect, how to look inside? Let's look at method histogram for traces attributed to Hibernate trimming away all frames up to first Hibernate method call.

Below is command to do this.

sjk ssa -f --histo -tf **!**.jdbc -tt ogr.hibernate

Here is top of histogram produced by command

Trc     (%)  Frm  N  Term    (%)  Frame                                                                                                                                                                                  
699506  87%  699506       0   0%  org.hibernate.internal.SessionImpl.autoFlushIfRequired(                                                                                                          
689370  85%  689370      10   0%  org.hibernate.internal.QueryImpl.list(                                                                                                                              
676524  84%  676524       0   0%  org.hibernate.event.internal.DefaultAutoFlushEventListener.onAutoFlush(                                                                          
675136  84%  675136       0   0%  org.hibernate.internal.SessionImpl.list(                                                                                                                         
573836  71%  573836       4   0%  org.hibernate.ejb.QueryImpl.getResultList(                                                                                                                          
550968  68%  550968       1   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEverythingToExecutions(                                                          
533892  66%  533892     132   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEntities(                                                                       
381514  47%  381514     882   0%  org.hibernate.event.internal.AbstractVisitor.processEntityPropertyValues(                                                                                      
271018  33%  271018       0   0%  org.hibernate.event.internal.DefaultFlushEntityEventListener.onFlushEntity(

Here is our suspect. We spent 87% of Hibernate time in autoFlushIfRequired() call (and JDBC time is already excluded).

Using few commands we have narrowed down one performance bottleneck. Fixing it is another topic though.

In a case, I'm using as example, CPU usage of application were reduced by 10 times. Few problems found and addressed during that case were

  • optimization of Hibernate usage
  • facelets compilation caching were properly configure
  • work around performance bug in Seam framework was implemented
  • JSF layouts were optimized to reduce number of Seam injections / outjections

Limitations of this approach

During statistical analysis of stack traces you deal with wallclock time, you cannot guest real CPU time using this method. If CPU on host is saturated, your number will be skewed by the threads idle time due to CPU starvation.

Normally you can get stack trace only at JVM safepoints. So if some methods are inlined by JIT compiler, they may never appear at trace even if they are really busy. In other words, tip of stack trace may be skewed by JIT effects. Practically, it was never an obstacle for me, but you should be keep in mind possibility of such effect.

What about flame graphs?

Well, despite being not so useful, they look good on presentations. Support for flame graphs was added to SJK recently.

Monday, October 12, 2015

Does Linux hate Java?

Recently, I have discovered a fancy bug affecting few version of Linux kernel. Without any warnings JVM just hangs in GC pause forever. Root cause is a improper memory access in kernel code. This post by Gil Tene gives a good technical explanation with deep emotional coloring.

While this bug is not JVM specific, there are few other multithreaded processes you can find on typical Linux box.

This recent bug make me remember few other cases there Linux screws Java badly.

Transparent huge pages

Transparent huge pages feature was introduced in 2.6.38 version of kernel. While it was intended to improve performance, a lot of people reports negative effects related to this feature, especially for memory intensive processes such as JVM and some database engines.

  • Oracle - Performance Issues with Transparent Huge Pages
  • Transparent Huge Pages and Hadoop workloads
  • Why TokuDB Hates Transparent Huge Pages
  • Leap seconds bug

    Famous leap second bug in Linux has produced a whole plague across data centers in 2012. Java and MySQL were affected most badly. What a common between Java and MySQL, both are using threads extensively.

    So, Linux, could you be a little more gentle with Java, please ;)

    Tuesday, August 4, 2015

    SJK - missing link in Java profiling tool chain

    Sometimes it just happens. You have a bloated Java application at your hand and it does not perform well. You may have built this application yourself or just got it as it is now. It doesn't matter, thing is - you do not have a slightest idea what is wrong here.

    Java ecosystem have abundance of diagnostic tools (thank for interfaces exposed at JVM itself), but they are mostly focused on some specific narrow kinds of problems. Despite calling themselves intuitive, they assume you have a lot of background knowledge about JVM and profiling techniques. Honestly, even seasoned Java (I'm speaking for myself here) developer can feel lost first time looking at JProfiler, YourKit of Mission Control.

    If you have a performance problem at your hand, first you need is to classify problem: is it in Java or database or somewhere else? is CPU or memory kind of problem? Once you know what kind of problem you have, you can choose next diagnostic approach consciously.

    Are we CPU bound?

    One of first thing you would naturally do is to check CPU usage of your process. OS can show you process CPU usage. Which is useful, but the next question is which threads are consuming it. OS can show you threads usage too, you can even get OS IDs for your Java threads using jstack and correlate them ... manually (sick).

    A simple tool showing CPU usage per Java thread is the thing I wanted badly for the years.

    Surprisingly, all information is already in JMX Threading MBean. All is left is to do trivial math and report per thread CPU usage. So I just did it and ttop command become first in SJK tool set.

    Besides CPU usage JMX have another invaluable metric - per thread allocation counter.

    Collecting information from JMX is safe and can be done on live application instance (in case if you do not have JMX port open, SJK can connect using process ID).

    Below is example of ttop command output.

    2014-10-01T19:27:22.825+0400 Process summary
      process cpu=101.80%
      application cpu=100.50% (user=86.21% sys=14.29%)
      other: cpu=1.30%
      GC cpu=0.00% (young=0.00%, old=0.00%)
      heap allocation rate 123mb/s
      safe point rate: 1.5 (events/s) avg. safe point pause: 0.14ms
      safe point sync time: 0.00% processing time: 0.02% (wallclock time)
    [000037] user=83.66% sys=14.02% alloc=  121mb/s - Proxy:ExtendTcpProxyService1:TcpAcceptor:TcpProcessor
    [000075] user= 0.97% sys= 0.08% alloc=  411kb/s - RMI TCP Connection(35)-
    [000029] user= 0.61% sys=-0.00% alloc=  697kb/s - Invocation:Management
    [000073] user= 0.49% sys=-0.01% alloc=  343kb/s - RMI TCP Connection(33)-
    [000023] user= 0.24% sys=-0.01% alloc=   10kb/s - PacketPublisher
    [000022] user= 0.00% sys= 0.10% alloc=   11kb/s - PacketReceiver
    [000072] user= 0.00% sys= 0.07% alloc=   22kb/s - RMI TCP Connection(31)-
    [000056] user= 0.00% sys= 0.05% alloc=   20kb/s - RMI TCP Connection(25)-
    [000026] user= 0.12% sys=-0.07% alloc=  2217b/s - Cluster|Member(Id=18, Timestamp=2014-10-01 15:58:3 ...
    [000076] user= 0.00% sys= 0.04% alloc=  6657b/s - JMX server connection timeout 76
    [000021] user= 0.00% sys= 0.03% alloc=   526b/s - PacketListener1P
    [000034] user= 0.00% sys= 0.02% alloc=  1537b/s - Proxy:ExtendTcpProxyService1
    [000049] user= 0.00% sys= 0.02% alloc=  6011b/s - JMX server connection timeout 49
    [000032] user= 0.00% sys= 0.01% alloc=     0b/s - DistributedCache

    Besides CPU and allocation, it also collect "true" GC usage and safe point statistics. Later two metrics are not available via JMX so they are available only for process ID connections.

    CPU usage picture will give you good insight what to do next: should you profile your Java hot spots or all time is spent waiting result from DB.

    Garbage analysis

    Another common class of Java problems is related to garbage collection. If this is a case GC logs is first place to look at.

    Do you have them enabled? If not, that is not a big deal, you can enable GC logging on running JVM process using jinfo command. You can also use SJK's gc command to peek GC activity for your java process (it is not as full as GC logs tough).

    If GC logs confirm what GC is causing you problems, next step is to identify where that garbage comes from.

    Commercial profilers are good at memory profiling, but this kind of analysis slows down target application dramatically.

    Mission Control stands out of pack here, it can profile by sampling TLAB allocation failures. This technique is cheap and generally produce good results, though it is inherently biased and may mislead you sometimes.

    For long time jmap and class histogram were main memory profiling instrument for me. Class histogram is simple and accurate.

    In SJK] toolset, I have augmented vanila jmap command a little to make it more useful (SJK's [hh command).

    Beware that jmap (and thus hh command) required Stop the World pause on target JVM while heap is being walked, so it may not be a good idea to execute it against live application under load.

    Dead heap histogram is calculated as difference between object population before and after forced GC (using jmap class histogram command under hood).

    Dead young heap histogram enforces full GC then wait 10 seconds (by default) then produce dead object histogram by technique describe above. Thus you see a summary freshly allocated garbage.

    This methods cannot not tell you where in your code that garbage was allocated (this is job for Mission Control et al ). Though, if you know that is your top garbage objects, you may already know there they are allocated.

    SJK have a few more tools but these two ttop and hh are always in front lines when I need to tackle another performance related problem.