有个hadoop任务,使用了多线程,且在主线程里面设置了未捕获异常的处理:
[mw_shl_code=java,true]Thread.setDefaultUncaughtExceptionHandler(new InstantExceptionHandler());
...
public class InstantExceptionHandler implements UncaughtExceptionHandler {
private static final Log logger = LogFactory.getLog(InstantExceptionHandler.class);
@Override
public void uncaughtException(Thread t, Throwable e) {
// do something
System.exit(1);
}
}[/mw_shl_code]
某天线上发现这个hadoop任务的一个task不干活了,表现为task log截止到某个时间点后不再有日志输出,且log里面能看到有几个线程报出了OOM异常。正常来说在处理未捕获异常的InstantExceptionHandler里会调用System.exit(1)终止整个程序,但是程序却没有退出。
jstack的信息如下:
[mw_shl_code=bash,true]......
"Thread-74" prio=10 tid=0x00007f66d9c8b000 nid=0x165d6 waiting for monitor entry [0x000000004530a000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock <0x00000007614b85a0> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at com.xxx.InstantExceptionHandler.uncaughtException(InstantExceptionHandler.java:26)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1985)
.......
"Thread-87" prio=10 tid=0x00007f66d9ee4000 nid=0x165e3 waiting for monitor entry [0x0000000046017000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock <0x00000007614b85a0> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at com.xxx.InstantExceptionHandler.uncaughtException(InstantExceptionHandler.java:26)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1985)
......
"Thread-101" prio=10 tid=0x00007f66d9eeb800 nid=0x165ed in Object.wait() [0x0000000046a21000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1280)
- locked <0x00000007614b8358> (a org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1354)
at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked <0x00000007614b85a0> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at com.xxx.InstantExceptionHandler.uncaughtException(InstantExceptionHandler.java:26)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1985)
......[/mw_shl_code]
从jstack log分析可以看到,Thread-74和Thread-87两个线程(还有很多个这样的线程,这里只截取了两个)都在等待0x00000007614b85a0这个锁,而0x00000007614b85a0锁被Thread-101持有。Thread-101里面运行的是InstantExceptionHandler,里面调用的System.exit在获取锁后使用join等待某个线程退出。通过后面查找资料(http://hllvm.group.iteye.com/group/topic/28988),发现Thread-101其实等待的就是Thread-74和Thread-87等类似线程退出,这里其实是发生了死锁。 jdk文档(http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#exit(int))上对System.exit的说明是:
终止当前正在运行的 Java 虚拟机。等效于调用:Runtime.getRuntime().exit(n)
jdk文档(http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#exit(int))对Runtime#exit的说明是:
Terminates the currently running Java virtual machine by initiating its shutdown sequence. This method never returns normally. The argument serves as a status code; by convention, a nonzero status code indicates abnormal termination.
The virtual machine’s shutdown sequence consists of two phases. In the first phase all registered shutdown hooks, if any, are started in some unspecified order and allowed to run concurrently until they finish. In the second phase all uninvoked finalizers are run if finalization-on-exit has been enabled. Once this is done the virtual machine halts.
If this method is invoked after the virtual machine has begun its shutdown sequence then if shutdown hooks are being run this method will block indefinitely. If shutdown hooks have already been run and on-exit finalization has been enabled then this method halts the virtual machine with the given status code if the status is nonzero; otherwise, it blocks indefinitely.
翻译过来就是:
通过启动虚拟机的关闭序列,终止当前正在运行的 Java 虚拟机。此方法从不正常返回。可以将变量作为一个状态码;根据惯例,非零的状态码表示非正常终止。
虚拟机的关闭序列包含两个阶段。在第一个阶段中,会以某种未指定的顺序启动所有已注册的关闭钩子 (hook)(如果有的话),并且允许它们同时运行直至结束。在第二个阶段中,如果已启用退出终结,则运行所有未调用的终结方法。一旦完成这个阶段,虚拟机就会暂停。
如果在虚拟机已开始其关闭序列后才调用此方法,那么若正在运行关闭钩子,则将无限期地阻塞此方法。如果已经运行完关闭钩子,并且已启用退出终结 (on-exit finalization),那么此方法将利用给定的状态码(如果状态码是非零值)暂停虚拟机;否则将无限期地阻塞虚拟机。
重点是这句:如果在虚拟机已开始其关闭序列后才调用此方法,那么若正在运行关闭钩子,则将无限期地阻塞此方法。从jstack里面也能看到shutdownHook的线程Thread-74和Thread-87都是BLOCKED状态。
下面是一个验证的小程序:
[mw_shl_code=java,true]public class TestShutDownHook {
public static void main(String[] args) {
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
System.out.println("call hook exit");
System.exit(0);
}
});
System.out.println("before exit");
System.exit(0);
System.out.println("finish exit");
}
}
1
程序运行后会输出:
1
before exit
call hook exit[/mw_shl_code]
然后程序一直挂着无法退出。
jstack后可以看到:
[mw_shl_code=bash,true]"Thread-0" prio=5 tid=0x00007f8b2093f800 nid=0x4903 waiting for monitor entry [0x00000001179c6000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock <0x00000007eabd3840> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at com.taobao.dump.java.TestShutDownHook$1.run(TestShutDownHook.java:8)
......
"main" prio=5 tid=0x00007f8b21800800 nid=0x1303 in Object.wait() [0x000000010e753000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000007eabd3550> (a com.taobao.dump.java.TestShutDownHook$1)
at java.lang.Thread.join(Thread.java:1280)
- locked <0x00000007eabd3550> (a com.taobao.dump.java.TestShutDownHook$1)
at java.lang.Thread.join(Thread.java:1354)
at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked <0x00000007eabd3840> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at com.taobao.dump.java.TestShutDownHook.main(TestShutDownHook.java:13)[/mw_shl_code]
可以看到Thread-0在等待0x00000007eabd3840锁,而main主线程持有0x00000007eabd3840锁,主线程的System.exit调用触发了shutdownHook并使用join等待Thread-0退出,这里陷入死锁。
这个jstack的信息与线上程序的jstack信息完全一样。可以推测hadoop task无法退出的原因是:该task的某个线程OOM异常触发UncaughtExceptionHandler,里面会调用System.exit。而System.exit会触发org.apache.hadoop.util.ShutdownHookManager。在这个ShutdownHookManager里面会调用所有注册过的hookHandler,应该是某个hookHandler调用了System.exit从而导致block。
参考:http://stackoverflow.com/questions/3715967/when-should-we-call-system-exit-in-java
|