2020-09-15 18:09:46,009 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:21:31,592 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:33:44,648 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:45:31,393 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1600162721525_0026 with 0 attempts and final state = NONE
...
2020-09-15 18:55:21,618 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1600162721525_0026 with final state: FAILED
复制代码
2.同时在Active ResourceManager(cmsnn002)日志中看到如下与Zookeeper相关的报错,通过以下日志我们可以看到由于 Zookeeper 的连接异常导致 Active ResourceManager进入 Standby 状态【2】:
2020-09-15 16:36:00,882 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up!
2020-09-15 16:36:00,882 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2020-09-15 16:36:00,882 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected
2020-09-15 16:36:00,883 WARN org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning the resource manager to standby.
2020-09-15 16:36:00,921 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
2020-09-15 16:47:09,713 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /yarn-leader-election/yarnRM/ActiveBreadCrumb to indicate that the local node is the most recent active...
2020-09-15 16:47:09,718 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to active state
2020-09-15 16:47:22,879 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 10176 applications
2020-09-15 16:25:43,431 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x17187f84f281336 due to java.io.IOException: Len error 14307055
复制代码
6.通过查找资料,在Cloudera官网中有一个Knowledge Base提到了Zookeeper中相似的问题【6】,里面说到此问题和Zookeeper 的Jute Max Buffer参数配置的大小有关。
2020-09-15 16:25:43,431 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x17187f84f281336 due to java.io.IOException: Len error 14307055