分享

hadoop2.6 namenode 切换问题

hrsjw1 发表于 2018-2-27 13:03:32 [显示全部楼层] 只看大图 回帖奖励 阅读模式 关闭右栏 18 14881
本帖最后由 hrsjw1 于 2018-2-27 14:06 编辑

          H,各位大神,近期我们集群遇到一个棘手问题,hadoop集群namenode 总是互相切换,切换的原因是zkfc服务 在监听本机namenode 服务时 出现Connection timed out 现象,我们通过调整超时时间缓解 切换问题,可是治标不治本,call queue 队列也正常,在切换时候,服务器的 cpu、mem、net、io等都不高,而且zkfc与namenode 通讯使用的lo 网卡 ,在抓包时发现 故障切换期间的 包不全,我们对服务器(HP服务器)的网卡驱动等 进行了升级,目前也升级到了 最新版本。可是问题还没有解决。下面是我整理的一些zkfc的日志 和抓包情况,请大家帮忙看下,谢谢 。

集群环境:
os:CentOS Linux release 7.2.1511
hadoop version  :2.6.0-cdh5.5.2
java version : 1.7.0_80

nn1 ip:172.18.0.1
nn2 ip :172.18.0.2

ntp server:172.31.0.110
部署方式:cloudera manager 部署
cloudera manager version:Cloudera Express 5.5.2

*********************

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536        
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 21548526  bytes 7022052126 (6.5 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 21548526  bytes 7022052126 (6.5 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
*********************


2018-02-27 11:32:59,781 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: ACTIVE readyToBecomeActive: true}
2018-02-27 11:33:53,050 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: STANDBY readyToBecomeActive: true}
2018-02-27 11:33:00,782 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1775252113) connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022 from hdfs sending #378814

2018-02-27 11:33:19,932 DEBUG org.apache.hadoop.ipc.Client: closing ipc connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: Connection timed outjava.io.IOException: Connection timed out
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
        at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:526)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1088)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:983)
2018-02-27 11:33:19,934 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1775252113) connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022 from hdfs: closed
2018-02-27 11:33:19,934 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1775252113) connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022 from hdfs: stopped, remaining connections 0
2018-02-27 11:33:19,935 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Exception <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {java.io.IOException: Failed on local exception: java.io.IOException: Connection timed out; Host Details : local host is: "bj-dc-namenode-001.tendcloud.com/172.18.0.1"; destination host is: "bj-dc-namenode-001.tendcloud.com":8022; }
2018-02-27 11:33:19,936 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: Failed on local exception: java.io.IOException: Connection timed out; Host Details : local host is: "bj-dc-namenode-001.tendcloud.com/172.18.0.1"; destination host is: "bj-dc-namenode-001.tendcloud.com":8022;
2018-02-27 11:33:19,936 DEBUG org.apache.hadoop.ipc.Client: stopping client from cache: org.apache.hadoop.ipc.Client@1d838b46
2018-02-27 11:33:19,936 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2018-02-27 11:33:19,936 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022 entered state: SERVICE_NOT_RESPONDING

2018-02-27 11:33

2018-02-27 11:33

*********************

2018-02-27 06:55:30,764 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: ACTIVE readyToBecomeActive: true}

2018-02-27 06:58:56,776 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: STANDBY readyToBecomeActive: true}

2018-02-27 06:55:31,765 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1775252113) connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022 from hdfs sending #346510

2018-02-27 06:55:50,908 DEBUG org.apache.hadoop.ipc.Client: closing ipc connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: Connection timed outjava.io.IOException: Connection timed out
*********************

2018-02-27 06:31:01,702 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: ACTIVE readyToBecomeActive: true}
2018-02-27 06:31:45,570 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: STANDBY readyToBecomeActive: true}

2018-02-27 06:31:02,703 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1775252113) connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022 from hdfs sending #343976

201802270631.png
*********************

2018-02-27 06:20:27,757 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: ACTIVE readyToBecomeActive: true}
2018-02-27 06:23:56,561 TRACE org.apache.hadoop.ipc.ProtobufRpcEngine: 27: Response <- bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: getServiceStatus {state: STANDBY readyToBecomeActive: true}

2018-02-27 06:20:28,758 DEBUG org.apache.hadoop.ipc.Client: IPC Client (1775252113) connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022 from hdfs sending #343368

2018-02-27 06:20:47,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 26675ms for sessionid 0x561c6c5bd3011b3, closing socket connection and attempting reconnect
2018-02-27 06:20:47,901 DEBUG org.apache.hadoop.ipc.Client: closing ipc connection to bj-dc-namenode-001.tendcloud.com/172.18.0.1:8022: Connection timed outjava.io.IOException: Connection timed out


2018-02-27 06:20

2018-02-27 06:20

*********************















已有(18)人评论

跳转到指定楼层
easthome001 发表于 2018-2-27 13:37:04
172.18.0.1这个是ip地址,还是网关?

回复

使用道具 举报

hello2019 发表于 2018-2-27 13:40:46
里面都是本机连接自己出现问题。是不是回环网卡的问题。
你的hosts是如何配置的。贴出来看下。
回复

使用道具 举报

nextuser 发表于 2018-2-27 13:43:20
ntp是否配置了,如果时间不一致,也容易出现这种情况
回复

使用道具 举报

hrsjw1 发表于 2018-2-27 13:53:02
easthome001 发表于 2018-2-27 13:37
172.18.0.1这个是ip地址,还是网关?

这个是 主机的ip地址  服务端口是 8022
回复

使用道具 举报

hrsjw1 发表于 2018-2-27 13:55:52
hello2019 发表于 2018-2-27 13:40
里面都是本机连接自己出现问题。是不是回环网卡的问题。
你的hosts是如何配置的。贴出来看下。

172.18.0.1:root@bj-dc-namenode-001:/root]# more /etc/hosts
#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.18.0.1        bj-dc-namenode-001.tendcloud.com
172.18.0.2        bj-dc-namenode-002.tendcloud.com
172.18.0.3        bj-dc-yarn-001.tendcloud.com
172.18.0.4        bj-dc-yarn-002.tendcloud.com
下面的都是集群的节点信息

最上面两个是 出现问题后 google 说注释掉 试试的

回复

使用道具 举报

hrsjw1 发表于 2018-2-27 13:58:26
nextuser 发表于 2018-2-27 13:43
ntp是否配置了,如果时间不一致,也容易出现这种情况

嗯,这个集群使用的 cloudera manager  进行管理的  配置了 ntp服务  
172.18.0.1:root@bj-dc-namenode-001:/root]# crontab  -l
#sync system time
13 5,9,14,19 * * * /usr/sbin/ntpdate 172.31.0.110 >/dev/null 2>&1

172.18.0.1:root@bj-dc-namenode-001:/root]# grep '^server' /etc/ntp.conf
server 172.31.0.110

这是我们的时钟服务器

回复

使用道具 举报

hrsjw1 发表于 2018-2-27 14:01:57
hrsjw1 发表于 2018-2-27 13:55
172.18.0.1:root@bj-dc-namenode-001:/root]# more /etc/hosts#127.0.0.1   localhost localhost.localdo ...

172.18.0.2:root@bj-dc-namenode-002:/root]# more /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.18.0.1        bj-dc-namenode-001.tendcloud.com
172.18.0.2        bj-dc-namenode-002.tendcloud.com

这是nn2 de hosts信息 最上面没有注释
他俩还是互相切换

回复

使用道具 举报

hrsjw1 发表于 2018-2-27 14:08:06
hrsjw1 发表于 2018-2-27 13:58
嗯,这个集群使用的 cloudera manager  进行管理的  配置了 ntp服务  172.18.0.1:root@bj-dc-namenode-00 ...

我确认过 时间 都是正常的 没有不一致  、 快或者慢了的情况
回复

使用道具 举报

hello2019 发表于 2018-2-27 14:51:22
hrsjw1 发表于 2018-2-27 14:08
我确认过 时间 都是正常的 没有不一致  、 快或者慢了的情况

172.18.0.1:root@bj-dc-namenode-001:/root]# more /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.18.0.1        bj-dc-namenode-001.tendcloud.com
172.18.0.2        bj-dc-namenode-002.tendcloud.com
172.18.0.3        bj-dc-yarn-001.tendcloud.com
172.18.0.4        bj-dc-yarn-002.tendcloud.com

把上面红字部分放开

回复

使用道具 举报

12下一页
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条