最近hdfs集群出现异常,出现 : 此 DataNode 未连接到其一个或多个 NameNode。 报警
查看日志,也有找到一些错误信息,但还是没有定位到具体问题原因。
该集群之前一直都正常运行,最近也并没有做什么改动,就是突然发生该情况,当时没发现设么,就重启datanode ok。
但该问题并没有解决,即后续每个2-3天会有1个datanode出现该问题(不是同一台datanode)。
查看了监控,发生问题的时间点cpu,内存及GC情况均未出现较大波动(发生问题的时间并非业务高峰时段)。
下面该时间段的一些WARN和ERR信息 ipaddr 是发生问题的datanode的ip:
namenode日志:
晚上6点31:01.610分 | WARN | BlockManager | PendingReplicationMonitor timed out blk_1176622248_105822235 | 晚上6点31:01.610分 | WARN | BlockManager | PendingReplicationMonitor timed out blk_1176622125_105822105 | 晚上6点31:01.610分 | WARN | BlockManager | PendingReplicationMonitor timed out blk_1176622384_105822379 | 晚上6点31:01.610分 | WARN | BlockManager | PendingReplicationMonitor timed out blk_1176621497_105821443 | 晚上6点31:01.610分 | WARN | BlockManager | PendingReplicationMonitor timed out blk_1176622521_105822523 | 晚上6点31:01.610分 | WARN | BlockManager | PendingReplicationMonitor timed out blk_1176621498_105821444 | 晚上6点31:01.610分 | WARN | BlockManager | PendingReplicationMonitor timed out blk_1176621116_105821044 |
晚上6点31:27.390分 | WARN | NetworkTopology | The cluster does not contain node: /default/ipaddr:50010 | 晚上6点31:27.390分 | WARN | NetworkTopology | The cluster does not contain node: /default/ipaddr:50010 | 晚上6点31:27.390分 | WARN | NetworkTopology | The cluster does not contain node: /default/ipaddr:50010 | 晚上6点31:27.390分 | WARN | NetworkTopology | The cluster does not contain node: /default/ipaddr:50010 | 发生问题的datanode的日志:
晚上6点31:03.506分 | WARN | UserGroupInformation | PriviledgedActionException as:blk_1175113165_104282103 (auth:SIMPLE) cause:java.io.IOException: replica.getGenerationStamp() < block.getGenerationStamp(), block=BP-202159622-xx.xx.xx.xx-1529480710771:blk_1175113165_104282103, replica=ReplicaWaitingToBeRecovered, blk_1175113165_104281990, RWR getNumBytes() = 35262477 getBytesOnDisk() = 35262477 getVisibleLength()= -1 getVolume() = /onstardiskl/dfs/dn/current getBlockFile() = /onstardiskl/dfs/dn/current/BP-202159622-xx.xx.xx.xx-1529480710771/current/rbw/blk_1175113165 | 晚上6点31:03.506分 | INFO | Server | IPC Server handler 37 on 50020, call org.apache.hadoop.hdfs.protocol.ClientDatanodeProtocol.getReplicaVisibleLength from xx.xx.xx.xx:50356 Call#4713650 Retry#0
java.io.IOException: replica.getGenerationStamp() < block.getGenerationStamp(), block=BP-202159622-10.216.83.101-1529480710771:blk_1175113165_104282103, replica=ReplicaWaitingToBeRecovered, blk_1175113165_104281990, RWR getNumBytes() = 35262477 getBytesOnDisk() = 35262477 getVisibleLength()= -1 getVolume() = /onstardiskl/dfs/dn/current getBlockFile() = /onstardiskl/dfs/dn/current/BP-202159622-xx.xx.xx.xx-1529480710771/current/rbw/blk_1175113165
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaVisibleLength(FsDatasetImpl.java:2591) at org.apache.hadoop.hdfs.server.datanode.DataNode.getReplicaVisibleLength(DataNode.java:2756) at org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.getReplicaVisibleLength(ClientDatanodeProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17873) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211) |
晚上6点31:34.343分 | INFO | DataNode | Slow BlockReceiver write packet to mirror took 405ms (threshold=300ms) | 晚上6点31:34.841分 | INFO | DataNode | Slow BlockReceiver write packet to mirror took 411ms (threshold=300ms) | 晚上6点31:34.990分 | INFO | DataNode | Slow BlockReceiver write packet to mirror took 508ms (threshold=300ms) |
晚上6点32:01.135分 | INFO | clienttrace | src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 231a57d3b76a29bd4df8aaa59539067e, slotIdx: 27, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true | 晚上6点32:01.135分 | INFO | clienttrace | src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 8d170d6023e785411dbbe4bee2e64a18, slotIdx: 126, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true | 晚上6点32:01.136分 | INFO | clienttrace | src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: c3d854d0358db8162ba15be80d2376d5, slotIdx: 79, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true | 晚上6点32:01.136分 | INFO | clienttrace | src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 8d170d6023e785411dbbe4bee2e64a18, slotIdx: 77, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true0 |
晚上6点40:26.442分 | INFO | DataNode | Likely the client has stopped reading, disconnecting it (ipaddr:50010:DataXceiver error processing READ_BLOCK operation src: /ipaddr:57414 dst: /ipaddr:50010); java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/ipaddr:50010 remote=/ipaddr:57414] |
目前还不清楚是什么原因造成的这个问题,不知道有没有遇到过类似情况的小伙伴,希望大家帮帮忙看下,谢谢。
|
|