evababy 发表于 2019-5-17 16:20 没有呀,弯路也是路呀。最起码对问题认识清楚了。 |
https://issues.apache.org/jira/browse/HBASE-22439 登录后突然弹出选择中文界面。。。那就用中文描述喽,结果被鄙视,那这问题就不算问了。。。。 每次启动貌似做了一次不那么严禁的balancer,可能导致分配不均,貌似管理命令还没有提供全局的balancer吧,幸亏之前用spark写的全部表执行一次balancer,话说 重启就可能执行一次 balancer 也太扯淡了吧!! 还有0.98.X 重启并没有重新分配region吧? 以前真没有注意这个问题,测试环境已经升级到1.4.9了,一会拿正式环境重启一下做测试!!! |
他大爷的,吧日志级别改成debug后,测试重启20几次没出现region=0的情况,但是4台机器的数量稍有变化,不是上一次停机的数量!!!难道每次启动都做动平衡吗? |
多谢。 继续查问题:记录了几次日志情况,省略其他启动完全一致的日志后差异部分: 1号机器: 第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309] 第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494] 第三次正确:Current list of replicators: [gladslave1,16020,1558058594158] other RSs: [gladslave1,16020,1558058594158] 第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559] 2号机器: 第一次错误:Current list of replicators: [gladslave2,16020,1557999537309] other RSs: [gladslave2,16020,1557999537309] 第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931] 第三次正确:Current list of replicators: [gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] other RSs: [gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] 第四次正确:Current list of replicators: [gladslave2,16020,1558063128559] other RSs: [gladslave2,16020,1558063128559] 3号机器(问题机): 第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309] 第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494, gladslave3,16020,1558057204781] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494, gladslave3,16020,1558057204781] 第三次正确:Current list of replicators: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912, gladslave3,16020,1558058596843] other RSs: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912, gladslave3,16020,1558058596843] 第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave3,16020,1558063131092, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave3,16020,1558063131092] 4号机器: 第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309] 第二次错误:Current list of replicators: [gladslave4,16020,1558057202931] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931] 第三次正确:Current list of replicators: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] other RSs: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] 第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559, gladslave4,16020,1558063130892] 跟源码如下: [mw_shl_code=java,true]List<String> otherRegionServers = replicationTracker.getListOfRegionServers(); LOG.info("Current list of replicators: " + currentReplicators + " other RSs: " + otherRegionServers); // Look if there's anything to process after a restart for (String rs : currentReplicators) { if (!otherRegionServers.contains(rs)) { transferQueues(rs); } }[/mw_shl_code] 如果 other RSs 不包含Current list of replicators,就复制给Current list of replicators,但是查看问题机的日期,程序并不会执行transferQueues 虽然现在还不理解Current list of replicators 和 other RSs干什么用的,但是发现问题机“gladslave3”从未在 1、2、4号机器日志冲出现。是否可以说明“正常机”没使用过“问题机”做备份?replicators是否可以理解成HDFS的replication? |
evababy 发表于 2019-5-17 09:34 这个功能可以,起到保护功能。如果还没加载完数据,就各种操作,出问题的可能性非常大。这就相当于未加载完,集群处于受保护状态。 |
HDFS 数据检查 一切OK HBASE 数据检查 也OK 磁盘虽然大小不同,但是占用比正常,算是OK CPU 一位数 也OK 内存 有足够的剩余内存,也OK 网络 全部1000M 实则传输速率108M,也OK 我就不相信了,查不出问题。 |
bioger_hit 发表于 2019-5-17 09:00 hbase没有错误,感觉有点像启动时感知这台机器总资源少,所以就特意歇菜了,直接吧数据分给个其他机器加载。 只能继续查,不敢往正式上。 |
evababy 发表于 2019-5-17 08:56 这没什么问题的 你可以看他们的剩余空间,其实都是一样的。 500G机器,可能有其他大量的数据。 |