各位好,请教一个问题,这两天部署了一套Hadoop 2.7.1的HA集群,今天在测试failover时出了些问题。
测试使用三台服务器,各台服务器的服务分配如下:
node1 | zookeeper:QuorumPeerMain
hdfs:NameNode、DataNode、JournalNode、DFSZKFailoverController
yarn:ResourceManager、NodeManager
| node2 | zookeeper:QuorumPeerMain
hdfs:NameNode、DataNode、JournalNode、DFSZKFailoverController
yarn:ResourceManager、NodeManager
| node3 | zookeeper:QuorumPeerMain
hdfs:DataNode、JournalNode
yarn:NodeManager
|
一开始的测试还蛮顺利,当node1上的nn1处于active、node2上的nn2处于standby状态时,直接kill -9 nn1的pid,此时node2上的nn2可以顺利接管服务变为active状态。
但是重启集群,再次测试,尝试模拟掉电,停掉node1的网络(systemctl stop network)后,node2上的nn2却无法顺利接管,此时nn2的日志显示为:
2018-01-26 16:54:56,616 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
Checkpointing active NN at http://node1:50070
Serving checkpoints at http://node2:50070
2018-01-26 16:56:16,672 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node1/192.168.10.187:8485. Already tried 0 time(s); maxRetries=45
2018-01-26 16:56:24,137 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2018-01-26 16:56:24,138 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:347)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
2018-01-26 16:56:36,693 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node1/192.168.10.187:8485. Already tried 1 time(s); maxRetries=45
2018-01-26 16:56:56,715 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node1/192.168.10.187:8485. Already tried 2 time(s); maxRetries=45
2018-01-26 16:57:16,733 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node1/192.168.10.187:8485. Already tried 3 time(s); maxRetries=45
nn2一直在去尝试连接node1上的JournalNode....经过45次连接失败后,才继续failover过程,接管服务变为active状态,但是此时已经过去15分钟了.....
为了确认是不是JournalNode的问题,再次恢复集群,kill掉active namenode节点的JournalNode和NameNode服务,但是standby状态的Namenode可以成功failover.....再次模拟掉电却依然需要15分钟....请问这是什么原因呢?
|
|