Thursday, November 14, 2013

Coherence 101 - Soothing the Guardian

Guardian was introduced in Oracle Coherence 3.5 as uniform and reliable mean to detect and report various stalls and hangs on data grid members. In addition to monitoring internal components of Coherence, Guardian has an API accessible for application developer.

While out-of-box Guardian does its job pretty well, there are few aspects you can improve.

There 3 techniques to work with Coherence Guardian. Your can choose to employ all of them or just few.

Guardian heartbeats

Guardian is using heartbeat mechanics to detect thread stalls. Internally Coherence code explicitly heartbeat in appropriate points in code. Application code could use similar technique if long execution time is expected. CacheStores are good example of this.

  • GuardSupport.heartbeat() – sends normal heartbeat
  • GuardSupport.heartbeat(long) – allows you to pass expected time till next heartbeat (e.i. if you expect that SQL query to take several minutes, you could prevent log warning by passing reasonably long timeout before execution SQL statement)

Implementing guardable

Normally the guardian would try to "recover" thread if no heartbeats were received during timeout (eigther specified in configuration or last heartbeat(...) call).
This behavior can be overridden though. Application can register own Guardable and temporary disable monitoring of current thread. Below is a code snippet which wraps cache loader operations in Guardable preventing thread interruption (default way to "recover" worker thread).

public static class GuardianAwareCacheLoader implements CacheLoader {

    private CacheLoader loader;

    public GuardianAwareCacheLoader(CacheLoader loader) {
        this.loader = loader;
    }

    @Override
    public Object load(Object key) {
        GuardContext ctx = GuardSupport.getThreadContext();
        if (ctx != null) {
            KeyLoaderGuard guard = new KeyLoaderGuard(Collections.singleton(key));
            GuardContext klg = ctx.getGuardian().guard(guard); 
            GuardSupport.setThreadContext(klg);
        }
        try {
            return loader.load(key);
        }
        finally {
            if (ctx != null) {
                GuardContext klg = GuardSupport.getThreadContext();
                GuardSupport.setThreadContext(ctx);
                klg.release();
            }
        }
    }

    @Override
    @SuppressWarnings({ "rawtypes", "unchecked" })
    public Map loadAll(Collection keys) {
        GuardContext ctx = GuardSupport.getThreadContext();
        if (ctx != null) {
            KeyLoaderGuard guard = new KeyLoaderGuard(keys);
            GuardContext klg = ctx.getGuardian().guard(guard); 
            GuardSupport.setThreadContext(klg);
            // disable current context
            ctx.heartbeat(TimeUnit.DAYS.toMillis(365));
        }
        try {
            return loader.loadAll(keys);
        }
        finally {
            if (ctx != null) {
                GuardContext klg = GuardSupport.getThreadContext();
                GuardSupport.setThreadContext(ctx);
                klg.release();
                // reenable current context
                ctx.heartbeat();
            }
        }
    }
}

public static class KeyLoaderGuard implements Guardable {

    Collection<Object> keys;
    GuardContext context;

    public KeyLoaderGuard(Collection<Object> keys) {
        this.keys = keys;
    }

    @Override
    public GuardContext getContext() {
        return context;
    }

    @Override
    public void setContext(GuardContext context) {
        this.context = context;
    }

    @Override
    public void recover() {
        System.out.println("got RECOVER signal");
        context.heartbeat();
    }

    @Override
    public void terminate() {
        System.out.println("got TERMINATE signal");
    }

    @Override
    public String toString() {
        return "KeyLoaderGuard:" + keys;
    }
}

Using custom Guardable provides following advantages:

  • Additional context information is available and is logged for custom Guardable (e.g. SQL statement causing problems).
  • Custom code can choose how to react on timeout. You can choose to continue or try to cancel request somehow (e.g. closing JDBC connection).

Custom service failure policy

Service failure policy is responsible for reaction on guardian timeouts and critical service failures. Reaction is configurable, but for standalone Coherence processes I prefer to override this policy.

Below is example of service failure policy, which I find more reasonable for dedicated Coherence nodes.

public class ServiceFailureHandler implements ServiceFailurePolicy {

    private final static Logger LOGGER = LogManager.getLogger(ServiceFailureHandler.class);

    @Override
    public void onGuardableRecovery(Guardable guarable, Service service) {
        LOGGER.warn("Soft timeout detected. Service: " + service.getInfo().getServiceName() + " Task: " + guarable);
        guarable.recover();
    }

    @Override
    public void onGuardableTerminate(Guardable guarable, Service service) {
        LOGGER.error("Hard timeout detected. Service: " + service.getInfo().getServiceName()
                     + " Task: " + guarable + ". Node will be terminated.");
        halt();
    }

    @Override
    public void onServiceFailed(Cluster cluster) {
        LOGGER.error("Service failure detected. Node will be terminated.");
        halt();
    }

    private static void halt() {
        try {
            ThreadUtil.logThreadDump(LOGGER);
            LogManager.shutdown();
            System.out.flush();
            System.err.flush();
        } finally {
            Runtime.getRuntime().halt(1);
        }
    }
}

Compared to standard policy it has following advantages:

  • In case of service failure processes would be terminated quickly (without waiting for shutdown hooks etc). In my case, process would be restarted by external watch dog immediately then.
  • "Soft timeouts" will not pollute log with thread dumps. The only thread dump will be logged just before termination of process (which is especially important in case of implementing custom Guardable).

Conclusion

Integrating you application with Coherence Guardian doesn't require too much code, but could make your logs more clear and troubleshooting less painful. While it will not make your application work faster, it could save hours of digging though logs.

Wednesday, November 6, 2013

HotSpot JVM garbage collection options cheat sheet (v3)

Two years ago I have published cheat sheet for garbage collection options in HotSpot JVM.

Recently I decided give that work some refreshing and today I'm publishing first HostSpot JVM options ref card covering generic GC options and CMS tuning. (G1 have got a plenty of tuning options during last two years so it will have dedicated ref card).

Content-wise GC log rotation options have been added and few esoteric CMS diagnostic options have been removed.

Two page PDF version

Single page PDF version