Sunday, January 24, 2016

Flame Graphs Vs. Cold Numbers

Stack trace sampling is very powerful technique for performance troubleshooting. Advantages of stack trace sampling are

  • it doesn't require upfront configuration
  • cost added by sampling is small and controllable
  • it is easy to compare analysis result from different experiments

Unfortunately, tools offered for stack trace analysis by widespread Java profilers are very limited.

Solving performance problem in complex applications (a lot of business logic etc) is one of my regular challenges. Let's assume I have another misbehaving application at my hands. First step would be to localize bottleneck to specific part of stack.

Meet call tree

Call tree is built by digesting large number of stack traces. Each node in tree has a frequency - number of traces passing though this node.

Usually tools allow you to navigate through call tree reconstructed from stack trace population.

There is also flame graphs visualization (shown at right top of page) which is fancier but is just the same tree.

Looking at these visualization what can I see? - Not too much.

Why? Business logic somewhere in the middle of call tree produces too many branches. Tree beneath business logic is blurred beyond point of usability.

Dissecting call tree

Application is build using frameworks. For the sake of this article, I'm using example based on JBoss, JSF, Seam, Hibernate.

Now, if 13% of traces in our dump contain JDBC we can conclude what 13% of time is spent in JDBC / database calls.
13% is reasonable number, so database is not to blame here.

Let's go down the stack, Hibernate is next layer. Now we need to calculate all traces containing Hibernate classes excluding ones containing JDBC. This way we can attribute traces to particular framework and quickly get a picture where time is spent at runtime.

I didn't find any tool that can do it kind of analysis for me, so I build one for myself few years ago. SJK is my universal Java troubleshooting toolkit.

Below is command doing analysis explained above.

sjk ssa -f tracedump.std  --categorize -tf **.CoyoteAdapter.service -nc
JDBC=**.jdbc 
Hibernate=org.hibernate
"Facelets compile=com.sun.faces.facelets.compiler.Compiler.compile"
"Seam bijection=org.jboss.seam.**.aroundInvoke/!**.proceed"
JSF.execute=com.sun.faces.lifecycle.LifecycleImpl.execute
JSF.render=com.sun.faces.lifecycle.LifecycleImpl.render
Other=**

Below is output of this command.

Total samples    2732050 100.00%
JDBC              405439  14.84%
Hibernate         802932  29.39%
Facelets compile  395784  14.49%
Seam bijection    385491  14.11%
JSF.execute       290355  10.63%
JSF.render        297868  10.90%
Other             154181   5.64%

Well, we clearly see a large amount of time spent in Hibernate. This is very wrong, so it is first candidate for investigation. We also see that a lot of CPU is spent on JSF compilation, though pages should be compiled just once and cached (it turned out to be configuration issue). Actual application logic falls in JFS life cycle calls (execute(), render()). I would be possible to introduce additional category to isolate pure application logic execution time, but looking at numbers, I would say it is not necessary until other problems are solved.

Hibernate is our primary suspect, how to look inside? Let's look at method histogram for traces attributed to Hibernate trimming away all frames up to first Hibernate method call.

Below is command to do this.

sjk ssa -f --histo -tf **!**.jdbc -tt ogr.hibernate

Here is top of histogram produced by command

Trc     (%)  Frm  N  Term    (%)  Frame                                                                                                                                                                                  
699506  87%  699506       0   0%  org.hibernate.internal.SessionImpl.autoFlushIfRequired(SessionImpl.java:1204)                                                                                                          
689370  85%  689370      10   0%  org.hibernate.internal.QueryImpl.list(QueryImpl.java:101)                                                                                                                              
676524  84%  676524       0   0%  org.hibernate.event.internal.DefaultAutoFlushEventListener.onAutoFlush(DefaultAutoFlushEventListener.java:58)                                                                          
675136  84%  675136       0   0%  org.hibernate.internal.SessionImpl.list(SessionImpl.java:1261)                                                                                                                         
573836  71%  573836       4   0%  org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:264)                                                                                                                          
550968  68%  550968       1   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:99)                                                          
533892  66%  533892     132   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEntities(AbstractFlushingEventListener.java:227)                                                                       
381514  47%  381514     882   0%  org.hibernate.event.internal.AbstractVisitor.processEntityPropertyValues(AbstractVisitor.java:76)                                                                                      
271018  33%  271018       0   0%  org.hibernate.event.internal.DefaultFlushEntityEventListener.onFlushEntity(DefaultFlushEntityEventListener.java:161)

Here is our suspect. We spent 87% of Hibernate time in autoFlushIfRequired() call (and JDBC time is already excluded).

Using few commands we have narrowed down one performance bottleneck. Fixing it is another topic though.

In a case, I'm using as example, CPU usage of application were reduced by 10 times. Few problems found and addressed during that case were

  • optimization of Hibernate usage
  • facelets compilation caching were properly configure
  • work around performance bug in Seam framework was implemented
  • JSF layouts were optimized to reduce number of Seam injections / outjections

Limitations of this approach

During statistical analysis of stack traces you deal with wallclock time, you cannot guest real CPU time using this method. If CPU on host is saturated, your number will be skewed by the threads idle time due to CPU starvation.

Normally you can get stack trace only at JVM safepoints. So if some methods are inlined by JIT compiler, they may never appear at trace even if they are really busy. In other words, tip of stack trace may be skewed by JIT effects. Practically, it was never an obstacle for me, but you should be keep in mind possibility of such effect.

What about flame graphs?

Well, despite being not so useful, they look good on presentations. Support for flame graphs was added to SJK recently.

Update

After some time, I've found my self using flame graphs very actively. Yes, for certain situation this type of visualization doesn't make sense, but as a first bird eye look at the problem flame graphs are indispensable.