Guardian was introduced in Oracle Coherence 3.5 as uniform and reliable mean to detect and report various stalls and hangs on data grid members. In addition to monitoring internal components of Coherence, Guardian has an API accessible for application developer.
While out-of-box Guardian does its job pretty well, there are few aspects you can improve.
There 3 techniques to work with Coherence Guardian. Your can choose to employ all of them or just few.
Guardian heartbeats
Guardian is using heartbeat mechanics to detect thread stalls. Internally Coherence code explicitly heartbeat in appropriate points in code. Application code could use similar technique if long execution time is expected. CacheStores are good example of this.
GuardSupport.heartbeat()
– sends normal heartbeatGuardSupport.heartbeat(long)
– allows you to pass expected time till next heartbeat (e.i. if you expect that SQL query to take several minutes, you could prevent log warning by passing reasonably long timeout before execution SQL statement)
Implementing guardable
Normally the guardian would try to "recover" thread if no heartbeats were received during timeout (eigther specified in configuration or last heartbeat(...)
call).
This behavior can be overridden though.
Application can register own Guardable and temporary disable monitoring of current thread.
Below is a code snippet which wraps cache loader operations in Guardable preventing thread interruption
(default way to "recover" worker thread).
public static class GuardianAwareCacheLoader implements CacheLoader {
private CacheLoader loader;
public GuardianAwareCacheLoader(CacheLoader loader) {
this.loader = loader;
}
@Override
public Object load(Object key) {
GuardContext ctx = GuardSupport.getThreadContext();
if (ctx != null) {
KeyLoaderGuard guard = new KeyLoaderGuard(Collections.singleton(key));
GuardContext klg = ctx.getGuardian().guard(guard);
GuardSupport.setThreadContext(klg);
}
try {
return loader.load(key);
}
finally {
if (ctx != null) {
GuardContext klg = GuardSupport.getThreadContext();
GuardSupport.setThreadContext(ctx);
klg.release();
}
}
}
@Override
@SuppressWarnings({ "rawtypes", "unchecked" })
public Map loadAll(Collection keys) {
GuardContext ctx = GuardSupport.getThreadContext();
if (ctx != null) {
KeyLoaderGuard guard = new KeyLoaderGuard(keys);
GuardContext klg = ctx.getGuardian().guard(guard);
GuardSupport.setThreadContext(klg);
// disable current context
ctx.heartbeat(TimeUnit.DAYS.toMillis(365));
}
try {
return loader.loadAll(keys);
}
finally {
if (ctx != null) {
GuardContext klg = GuardSupport.getThreadContext();
GuardSupport.setThreadContext(ctx);
klg.release();
// reenable current context
ctx.heartbeat();
}
}
}
}
public static class KeyLoaderGuard implements Guardable {
Collection<Object> keys;
GuardContext context;
public KeyLoaderGuard(Collection<Object> keys) {
this.keys = keys;
}
@Override
public GuardContext getContext() {
return context;
}
@Override
public void setContext(GuardContext context) {
this.context = context;
}
@Override
public void recover() {
System.out.println("got RECOVER signal");
context.heartbeat();
}
@Override
public void terminate() {
System.out.println("got TERMINATE signal");
}
@Override
public String toString() {
return "KeyLoaderGuard:" + keys;
}
}
Using custom Guardable provides following advantages:
- Additional context information is available and is logged for custom Guardable (e.g. SQL statement causing problems).
- Custom code can choose how to react on timeout. You can choose to continue or try to cancel request somehow (e.g. closing JDBC connection).
Custom service failure policy
Service failure policy is responsible for reaction on guardian timeouts and critical service failures. Reaction is configurable, but for standalone Coherence processes I prefer to override this policy.
Below is example of service failure policy, which I find more reasonable for dedicated Coherence nodes.
public class ServiceFailureHandler implements ServiceFailurePolicy {
private final static Logger LOGGER = LogManager.getLogger(ServiceFailureHandler.class);
@Override
public void onGuardableRecovery(Guardable guarable, Service service) {
LOGGER.warn("Soft timeout detected. Service: " + service.getInfo().getServiceName() + " Task: " + guarable);
guarable.recover();
}
@Override
public void onGuardableTerminate(Guardable guarable, Service service) {
LOGGER.error("Hard timeout detected. Service: " + service.getInfo().getServiceName()
+ " Task: " + guarable + ". Node will be terminated.");
halt();
}
@Override
public void onServiceFailed(Cluster cluster) {
LOGGER.error("Service failure detected. Node will be terminated.");
halt();
}
private static void halt() {
try {
ThreadUtil.logThreadDump(LOGGER);
LogManager.shutdown();
System.out.flush();
System.err.flush();
} finally {
Runtime.getRuntime().halt(1);
}
}
}
Compared to standard policy it has following advantages:
- In case of service failure processes would be terminated quickly (without waiting for shutdown hooks etc). In my case, process would be restarted by external watch dog immediately then.
- "Soft timeouts" will not pollute log with thread dumps. The only thread dump will be logged just before termination of process (which is especially important in case of implementing custom Guardable).
Conclusion
Integrating you application with Coherence Guardian doesn't require too much code, but could make your logs more clear and troubleshooting less painful. While it will not make your application work faster, it could save hours of digging though logs.