Monday, March 11, 2019

Lies, darn lies and sampling bias

Sampling profiling is very powerful technique widely used across various platforms for identifying hot code (execution bottlenecks).

In Java world sampling profiling (thread stack sampling to be precise) is supported by every serious profiler.

While being powerful and very handy in practice, sampling has well known weakness – sampling bias. It is real and well-known problem, though its practical impact is often being over exaggerated.

A picture is worth a thousand of words, so let me jump start with example.

Case 1

Below is a simple snippet of code. This snippet is doing cryptographic hash calculation over a bunch of random strings.

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.concurrent.TimeUnit;

public class CryptoBench {

 private static final boolean trackTime = Boolean.getBoolean("trackTime");
 
 public static void main(String[] args) {
  CryptoBench test = new CryptoBench();
  while(true) {
   test.execute();
  }  
 }
 
 public void execute() {
        long N = 5 * 1000 * 1000;
        RandomStringUtils randomStringUtils = new RandomStringUtils();
        long ts = 0,tf = 0;
        long timer1 = 0;
        long timer2 = 0;
        long bs = System.nanoTime();
        for (long i = 0; i < N; i++) {
         ts = trackTime ? System.nanoTime() : 0;
            String text = randomStringUtils.generate();
            tf = trackTime ? System.nanoTime() : 0;
            timer1 += tf - ts;
            ts = tf;
   crypt(text);
   tf = trackTime ? System.nanoTime() : 0;
   timer2 += tf - ts;
   ts = tf;
        }
        long bt = System.nanoTime() - bs;
        System.out.print(String.format("Hash rate: %.2f Mm/s", 0.01 * (N * TimeUnit.SECONDS.toNanos(1) / bt / 10000)));
        if (trackTime) {
         System.out.print(String.format(" | Generation: %.1f %%",  0.1 * (1000 * timer1 / (timer1 + timer2))));
         System.out.print(String.format(" | Hashing: %.1f %%", 0.1 * (1000 * timer2 / (timer1 + timer2))));
        }
        System.out.println();
 }

    public String crypt(String str) {
        if (str == null || str.length() == 0) {
            throw new IllegalArgumentException("String to encrypt cannot be null or zero length");
        }
        StringBuilder hexString = new StringBuilder();
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            md.update(str.getBytes());
            byte[] hash = md.digest();
            for (byte aHash : hash) {
                if ((0xff & aHash) < 0x10) {
                    hexString.append("0" + Integer.toHexString((0xFF & aHash)));
                } else {
                    hexString.append(Integer.toHexString(0xFF & aHash));
                }
            }
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        }
        return hexString.toString();
    }
}
code is available on github

Now let’s use a Visual VM (a profiler bundled with Java 8) and look how much time is actually spent in CryptoBench.crypt() method.

Something in definitely off in screenshot above!

CryptoBench.crypt(), method doing actual cryptography, is attributed only to 33% of execution time.
At same time, CryptoBench.execute() has 67% of self time, and that methods is doing nothing besides calling other methods.

Probably I just need a cooler profiler here. /s

Let’s use Java Flight Recorder for the very same case.
Below is screen shot from Mission Control. 


That looks much better!

CryptoBench.crypt() is now 86% of time our budget. Rest of time code spends in random string generation.
These numbers are looking more belivable to me.

Wait, wait, wait!

Integer.toHexString() is taking as much time as actual MD5 calculation. I cannot belive that.

Numbers are better than ones produced by VisualVM but they are still fishy enough.

Flight recorder is not cool enough for that task! We need really cool profiler! /s

Ok, let me bring some sense into this discrepancy between tools.

We were using thread stack sampling in both tools (Visual VM and Flight Recorder). Though, these tools capture stack traces differently.

Visual VM is actually sampling thread dumps (via thread dump support in JVM). Thread dumps include stack traces for every application thread in JVM, regardless of whatever thread's state is (blocked, sleeping or actually executing code) and this dump is taken atomically. It reflects instant execution state of whole JVM (which is important for deadlock/contention analysis). In practice, that implies short Stop the World pause for each dump. Stop the World pause means safepoint in hotspot JVM. And safepoints brings some nuances.

When Visual VM requests thread dump, JVM notifies threads to suspend execution, but a thread executing Java code wouldn’t stop immediately (unless it is interpreted). The thread would continue to run until next safepoint check where it can suspend itself. Checks cost CPU cycles so they are sparse in JIT generated code.

Checks are placed inside of loops and after method returns. Though, checks are omitted for loops considered “fast” by JIT compiler (typically integer indexed loops). Small methods are aggressively inlined too, hence omiting safepoint check at return. As a consequence, a hot and calculation intensive code may be optimized by JIT into single chunk of machine code which is mostly free of safepoint checks.

If you are lucky, thread dump would show you a line invoking the method containing hot code. With less luck result would be even more misleading.

So in Visual VM call tree we see method CryptoBench.execute() at top of the stack for 66% of samples. If we would be able to see call tree at line number granularity is would be a line calling CryptoBench.crypt() method.

Bad, ugly safepoint bias I’ve caught you red handed! /s

So, how Flight Recorder does sample stacks and why numbers are different?

Flight Recorder sampling doesn’t involve full thread dumps. Instead it freezes threads one by one using OS provided facilities. Once thread is frozen; we can get address of next instruction to be executed out of stack memory area. Address of instruction is converted into line number of java source code via byte code to machine code symbol map. The map is generated during JIT compilation. This is how stack trace is reconstructed.

In case of Flight Recorder safepoint bias does not apply. Though results are still looking inaccurate. Why?

Below is another session with Flight Recorder for the very same code.

 

Picture is different now.

Integer.toHexString() is just 2.25% of out execution budget which is more trustworthy in my eyes.

Flight Recoder has to resolve memory addresses back to reference of bytecode instruction (which is further transalted into Java source line). Mapping generated by JIT compiler is used for that purpose.

Though compiler is aware that we can see thread stack trace only at safepoints. By default, only safepoint checks are mapped into bytecode instruction indexes. Flight Recorder takes execution address from stack, then it finds next address mapped to Java code in symbol table. In case of aggressive inlining, Flight Recorder can map address to whole wrong point in code.

Though sampling itself is not biased by safepoints, symbol map generated by JIT compiler is.

In second example, I’ve used two JVM options to force more detailed symbol maps to be generated by JIT compler. Options are below.

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

More accurate, free of bias, symbol map allows Flight Recorder to produce more accurate stack traces.

In our mental model, code is being executed line by line (bytecode instruction by instruction). But complier lumps bunch of methods together and generates single blob of machine code, aggressively reordering operations in the middle of process to make code faster.
Our mental model of line by line execution is totally broken by compiler optimization.

Though, in practice artifacts of operation reordering are not that striking as safepoint bias.

So Java Flight Recorder is cool, Visual VM is not. Should I make this conclusion?

Let me present a counter example.

Case 2

Below is profiling reports from a differnt case.

Now I’m using flame graph generated from data captured by Visual VM and Flight Recorder (with –XX:+DebugNonSafepoints).

Visual VM report 

 

Flight Recorder report 

Both graphs are showing InflaterInputStream to be a bottleneck. Though Visual VM assesses time spent as 98%, but in Flight Recorder it is just 47%.

Who is right?

Correct answer is 92% (which is approximated using differential analysis).

My heart is broken! Flight Recorder is not a silver bullet. /s

What have gone wrong?

In this example, hot spot was related to JNI overhead involved with calling native code in zlib. It seems like Flight Recorder were unable reconstruct stack trace for certain samples outside of Java code and dropped these samples. Sample population was biased by native code execution. That bias has played against Flight Recorder in this case.

Conclusion

Both profilers are doing that they intended to do. Some sort of bias is natural for almost any kind of sampling.

Each sampling profiler could be categorized by three aspects.

  • Blind spots bias – which samples are excluded from data set collected by profiler.
  • Attractor bias – how samples be attracted to specific discrete points (e.g. safe point).
  • Resolution – unit of code which profiling data is being aggregated to (e.g. method, line number etc).

Below is summary table for sampling methods mentioned in this article.

Blind spot Attractor Resolution
JVM Thread Dump Sampling non-java threads safepoint bias java frames only
Java Flight Recorder non-java code execution CPU pipeline bias
+ code to source mapping skew
java frames only
Java Flight Recorder
+ DebugNonSafepoint
non-java code execution CPU pipeline bias
+ code to source mapping skew
java frames only