分享

救命啊,启动后特定机器无法加载region导致迁移给其他机器

本帖最后由 evababy 于 2019-5-16 16:14 编辑

hbase1.4.9 从0.98.12升级过来的的,测试环境下启动后,不一定都失败,大概几率70%左右。
固定的3号机无法加载regtion,且启动过程极慢,启动成功后数据被其他机器分摊,日志看不出一经典错误

已排除磁盘、网络、防火墙、ZK。。。。因素


正常节点日志
2019-05-16 15:22:44,540 INFO  [SplitLogWorker-gladslave4:16020] regionserver.SplitLogWorker: SplitLogWorker gladslave4,16020,1557991360664 starting
2019-05-16 15:22:44,541 INFO  [regionserver/gladslave4/10.86.10.104:16020] regionserver.HeapMemoryManager: Starting HeapMemoryTuner chore.
2019-05-16 15:22:44,543 INFO  [regionserver/gladslave4/10.86.10.104:16020] regionserver.HRegionServer: Serving as gladslave4,16020,1557991360664, RpcServer on gladslave4/10.86.10.104:16020, sessionid=0x101e4be76b0001d
2019-05-16 15:22:44,549 INFO  [regionserver/gladslave4/10.86.10.104:16020] quotas.RegionServerQuotaManager: Quota support disabled
2019-05-16 15:22:47,687 INFO  [regionserver/gladslave4/10.86.10.104:16020] wal.FSHLog: WAL configuration: blocksize=128 MB, rollsize=121.60 MB, prefix=gladslave4%2C16020%2C1557991360664, suffix=, logDir=hdfs://haservice/hbase/WALs/gladslave4,16020,1557991360664, archiveDir=hdfs://haservice/hbase/oldWALs
2019-05-16 15:22:48,050 INFO  [regionserver/gladslave4/10.86.10.104:16020] wal.FSHLog: Slow sync cost: 262 ms, current pipeline: []
2019-05-16 15:22:48,051 INFO  [regionserver/gladslave4/10.86.10.104:16020] wal.FSHLog: New WAL /hbase/WALs/gladslave4,16020,1557991360664/gladslave4%2C16020%2C1557991360664.1557991367687
2019-05-16 15:22:50,147 INFO  [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] regionserver.RSRpcServices: Open t_cr_org_news_mysql,23374,1539592131439.dd06c1db2c107c988032bd1f47b7b9df.
2019-05-16 15:22:50,175 INFO  [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] regionserver.RSRpcServices: Open portrayal_20180604,148896676,1548818066743.e129e1b1ec328a74ab45813c84f8d5bb.
2019-05-16 15:22:50,362 INFO  [StoreOpener-dd06c1db2c107c988032bd1f47b7b9df-1] hfile.CacheConfig: Created cacheConfig for fam: blockCache=LruBlockCache{blockCount=0, currentSize=944720, freeSize=1287545520, maxSize=1288490240, heapSize=944720, minSize=1224065664, minFactor=0.95, multiSize=612032832, multiFactor=0.5, singleSize=306016416, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
后续打印出很多类似日志  还有Open、Created cacheConfig、Onlined等info日志

经常出现错误的3号机器日志
2019-05-16 15:22:49,042 INFO  [SplitLogWorker-gladslave3:16020] regionserver.SplitLogWorker: SplitLogWorker gladslave3,16020,1557991363115 starting
2019-05-16 15:22:49,042 INFO  [regionserver/gladslave3/10.86.10.103:16020] regionserver.HeapMemoryManager: Starting HeapMemoryTuner chore.
2019-05-16 15:22:49,045 INFO  [regionserver/gladslave3/10.86.10.103:16020] regionserver.HRegionServer: Serving as gladslave3,16020,1557991363115, RpcServer on gladslave3/10.86.10.103:16020, sessionid=0x20004d1f8e1001b
2019-05-16 15:22:49,052 INFO  [regionserver/gladslave3/10.86.10.103:16020] quotas.RegionServerQuotaManager: Quota support disabled
2019-05-16 15:22:52,627 INFO  [regionserver/gladslave3/10.86.10.103:16020] wal.FSHLog: WAL configuration: blocksize=128 MB, rollsize=121.60 MB, prefix=gladslave3%2C16020%2C1557991363115, suffix=, logDir=hdfs://haservice/hbase/WALs/gladslave3,16020,1557991363115, archiveDir=hdfs://haservice/hbase/oldWALs
2019-05-16 15:22:53,281 INFO  [regionserver/gladslave3/10.86.10.103:16020] wal.FSHLog: Slow sync cost: 400 ms, current pipeline: []
2019-05-16 15:22:53,281 INFO  [regionserver/gladslave3/10.86.10.103:16020] wal.FSHLog: New WAL /hbase/WALs/gladslave3,16020,1557991363115/gladslave3%2C16020%2C1557991363115.1557991372627
2019-05-16 15:27:48,834 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=922.58 KB, freeSize=1.20 GB, max=1.20 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=29, evicted=0, evictedPerRun=0.0
2019-05-16 15:32:48,833 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=922.58 KB, freeSize=1.20 GB, max=1.20 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=59, evicted=0, evictedPerRun=0.0
2019-05-16 15:33:20,350 INFO  [664839586@qtp-2008547236-0] logs: Aliases are enabled
2019-05-16 15:37:48,833 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=922.58 KB, freeSize=1.20 GB, max=1.20 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=89, evicted=0, evictedPerRun=0.0

New WAL日志输出后就无反应了,目测5分钟打印一次红色日志,WAL文件已经生成了并且可以访问到
正常机器也打印的hfile.LruBlockCache日志为:

2019-05-16 15:20:48,008 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=3.58 MB, freeSize=1.20 GB, max=1.20 GB, blockCount=319, accesses=2858, hits=2091, hitRatio=73.16%, , cachingAccesses=2410, cachingHits=1735, cachingHitsRatio=71.99%, evictions=119, evicted=356, evictedPerRun=2.991596638655462

请问该如何查问题呢?没有错误啊


已有(11)人评论

跳转到指定楼层
s060403072 发表于 2019-5-16 19:55:13
本帖最后由 s060403072 于 2019-5-16 20:04 编辑

currentSize=944720, freeSize=1287545520, maxSize=1288490240, heapSize=944720, minSize=1224065664
可用空间可能太少了。
看下这台机器的配置跟其他机器的配置。
余额Quota 是如何配置的。
回复

使用道具 举报

evababy 发表于 2019-5-17 08:56:37
s060403072 发表于 2019-5-16 19:55
currentSize=944720, freeSize=1287545520, maxSize=1288490240, heapSize=944720, minSize=1224065664
可 ...

HDFS没有配置Quota,不过这台机器磁盘是500G  其他机器是1T的,数据总部平衡,特意手动平衡过数据。

新建位图图像.gif


HBASE 的namespace quota 也没有启用,所有琪琪日志都输出了“Quota support disabled

昨天继续查问题,也没结果,一点错误看不到。
1号机器:Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309]

2号机器:Current list of replicators: [gladslave2,16020,1557999537309] other RSs: [gladslave2,16020,1557999537309]

3号机器(问题机):Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309]

4号机器:Current list of replicators: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309]


回复

使用道具 举报

bioger_hit 发表于 2019-5-17 09:00:33
evababy 发表于 2019-5-17 08:56
HDFS没有配置Quota,不过这台机器磁盘是500G  其他机器是1T的,数据总部平衡,特意手动平衡过数据。


...

这没什么问题的

1.png
你可以看他们的剩余空间,其实都是一样的。
500G机器,可能有其他大量的数据。

回复

使用道具 举报

evababy 发表于 2019-5-17 09:08:13

hbase没有错误,感觉有点像启动时感知这台机器总资源少,所以就特意歇菜了,直接吧数据分给个其他机器加载。
只能继续查,不敢往正式上。


回复

使用道具 举报

evababy 发表于 2019-5-17 09:14:16
HDFS 数据检查  一切OK
HBASE 数据检查 也OK
磁盘虽然大小不同,但是占用比正常,算是OK
CPU 一位数 也OK
内存  有足够的剩余内存,也OK
网络 全部1000M 实则传输速率108M,也OK

我就不相信了,查不出问题。
回复

使用道具 举报

evababy 发表于 2019-5-17 09:34:24
disable了一些表,数据量是原来  1/4左右  频繁启动,启动时间在20秒左右,错误几率下降,启动了4次才出现一次错误,也基本排除了超时等问题导致的数据转移。 新建位图图像 (2).gif

0.98版 启动master很快,控制台会逐渐显示个regionserver加载数据以及数量的变化。但是1.4.9版本 启动后无法反正控制台,只有全部regionserver加载完region后才可以访问到控制台,这个功能是从1.X版本开始的吗?

回复

使用道具 举报

bioger_hit 发表于 2019-5-17 11:05:04
evababy 发表于 2019-5-17 09:34
disable了一些表,数据量是原来  1/4左右  频繁启动,启动时间在20秒左右,错误几率下降,启动了4次才出现 ...

这个功能可以,起到保护功能。如果还没加载完数据,就各种操作,出问题的可能性非常大。这就相当于未加载完,集群处于受保护状态。
回复

使用道具 举报

evababy 发表于 2019-5-17 11:42:57
多谢。

继续查问题:记录了几次日志情况,省略其他启动完全一致的日志后差异部分:
1号机器:
第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494]
第三次正确:Current list of replicators: [gladslave1,16020,1558058594158] other RSs: [gladslave1,16020,1558058594158]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559]

2号机器:
第一次错误:Current list of replicators: [gladslave2,16020,1557999537309] other RSs: [gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931]
第三次正确:Current list of replicators: [gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] other RSs: [gladslave1,16020,1558058594158, gladslave2,16020,1558058594912]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559] other RSs: [gladslave2,16020,1558063128559]

3号机器(问题机):
第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494, gladslave3,16020,1558057204781] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494, gladslave3,16020,1558057204781]
第三次正确:Current list of replicators: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912, gladslave3,16020,1558058596843] other RSs: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912, gladslave3,16020,1558058596843]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave3,16020,1558063131092, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave3,16020,1558063131092]

4号机器:
第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave4,16020,1558057202931] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931]
第三次正确:Current list of replicators: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] other RSs: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559, gladslave4,16020,1558063130892]


跟源码如下:
[mw_shl_code=java,true]List<String> otherRegionServers = replicationTracker.getListOfRegionServers();
    LOG.info("Current list of replicators: " + currentReplicators + " other RSs: "
        + otherRegionServers);

    // Look if there's anything to process after a restart
    for (String rs : currentReplicators) {
      if (!otherRegionServers.contains(rs)) {
        transferQueues(rs);
      }
    }[/mw_shl_code]
如果 other RSs 不包含Current list of replicators,就复制给Current list of replicators,但是查看问题机的日期,程序并不会执行transferQueues

虽然现在还不理解Current list of replicators 和 other RSs干什么用的,但是发现问题机“gladslave3”从未在 1、2、4号机器日志冲出现。是否可以说明“正常机”没使用过“问题机”做备份?replicators是否可以理解成HDFS的replication?

回复

使用道具 举报

evababy 发表于 2019-5-17 14:09:00
他大爷的,吧日志级别改成debug后,测试重启20几次没出现region=0的情况,但是4台机器的数量稍有变化,不是上一次停机的数量!!!难道每次启动都做动平衡吗?
回复

使用道具 举报

12下一页
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条