救命啊，启动后特定机器无法加载region导致迁移给其他机器-HBase-About云-梭伦科技

evababy 发表于 2019-5-16 16:02:58

救命啊，启动后特定机器无法加载region导致迁移给其他机器

本帖最后由 evababy 于 2019-5-16 16:14 编辑

hbase1.4.9 从0.98.12升级过来的的，测试环境下启动后，不一定都失败，大概几率70%左右。
固定的3号机无法加载regtion，且启动过程极慢，启动成功后数据被其他机器分摊，日志看不出一经典错误

已排除磁盘、网络、防火墙、ZK。。。。因素

正常节点日志
2019-05-16 15:22:44,540 INFO regionserver.SplitLogWorker: SplitLogWorker gladslave4,16020,1557991360664 starting
2019-05-16 15:22:44,541 INFO regionserver.HeapMemoryManager: Starting HeapMemoryTuner chore.
2019-05-16 15:22:44,543 INFO regionserver.HRegionServer: Serving as gladslave4,16020,1557991360664, RpcServer on gladslave4/10.86.10.104:16020, sessionid=0x101e4be76b0001d
2019-05-16 15:22:44,549 INFO quotas.RegionServerQuotaManager: Quota support disabled
2019-05-16 15:22:47,687 INFO wal.FSHLog: WAL configuration: blocksize=128 MB, rollsize=121.60 MB, prefix=gladslave4%2C16020%2C1557991360664, suffix=, logDir=hdfs://haservice/hbase/WALs/gladslave4,16020,1557991360664, archiveDir=hdfs://haservice/hbase/oldWALs
2019-05-16 15:22:48,050 INFO wal.FSHLog: Slow sync cost: 262 ms, current pipeline: []
2019-05-16 15:22:48,051 INFO wal.FSHLog: New WAL /hbase/WALs/gladslave4,16020,1557991360664/gladslave4%2C16020%2C1557991360664.1557991367687
2019-05-16 15:22:50,147 INFO regionserver.RSRpcServices: Open t_cr_org_news_mysql,23374,1539592131439.dd06c1db2c107c988032bd1f47b7b9df.
2019-05-16 15:22:50,175 INFO regionserver.RSRpcServices: Open portrayal_20180604,148896676,1548818066743.e129e1b1ec328a74ab45813c84f8d5bb.
2019-05-16 15:22:50,362 INFO hfile.CacheConfig: Created cacheConfig for fam: blockCache=LruBlockCache{blockCount=0, currentSize=944720, freeSize=1287545520, maxSize=1288490240, heapSize=944720, minSize=1224065664, minFactor=0.95, multiSize=612032832, multiFactor=0.5, singleSize=306016416, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
后续打印出很多类似日志还有Open、Created cacheConfig、Onlined等info日志

经常出现错误的3号机器日志
2019-05-16 15:22:49,042 INFO regionserver.SplitLogWorker: SplitLogWorker gladslave3,16020,1557991363115 starting
2019-05-16 15:22:49,042 INFO regionserver.HeapMemoryManager: Starting HeapMemoryTuner chore.
2019-05-16 15:22:49,045 INFO regionserver.HRegionServer: Serving as gladslave3,16020,1557991363115, RpcServer on gladslave3/10.86.10.103:16020, sessionid=0x20004d1f8e1001b
2019-05-16 15:22:49,052 INFO quotas.RegionServerQuotaManager: Quota support disabled
2019-05-16 15:22:52,627 INFO wal.FSHLog: WAL configuration: blocksize=128 MB, rollsize=121.60 MB, prefix=gladslave3%2C16020%2C1557991363115, suffix=, logDir=hdfs://haservice/hbase/WALs/gladslave3,16020,1557991363115, archiveDir=hdfs://haservice/hbase/oldWALs
2019-05-16 15:22:53,281 INFO wal.FSHLog: Slow sync cost: 400 ms, current pipeline: []
2019-05-16 15:22:53,281 INFO wal.FSHLog: New WAL /hbase/WALs/gladslave3,16020,1557991363115/gladslave3%2C16020%2C1557991363115.1557991372627
2019-05-16 15:27:48,834 INFO hfile.LruBlockCache: totalSize=922.58 KB, freeSize=1.20 GB, max=1.20 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=29, evicted=0, evictedPerRun=0.0
2019-05-16 15:32:48,833 INFO hfile.LruBlockCache: totalSize=922.58 KB, freeSize=1.20 GB, max=1.20 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=59, evicted=0, evictedPerRun=0.0
2019-05-16 15:33:20,350 INFO logs: Aliases are enabled
2019-05-16 15:37:48,833 INFO hfile.LruBlockCache: totalSize=922.58 KB, freeSize=1.20 GB, max=1.20 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=89, evicted=0, evictedPerRun=0.0

New WAL日志输出后就无反应了，目测5分钟打印一次红色日志，WAL文件已经生成了并且可以访问到
正常机器也打印的hfile.LruBlockCache日志为：

2019-05-16 15:20:48,008 INFO hfile.LruBlockCache: totalSize=3.58 MB, freeSize=1.20 GB, max=1.20 GB, blockCount=319, accesses=2858, hits=2091, hitRatio=73.16%, , cachingAccesses=2410, cachingHits=1735, cachingHitsRatio=71.99%, evictions=119, evicted=356, evictedPerRun=2.991596638655462

请问该如何查问题呢？没有错误啊

s060403072 发表于 2019-5-16 19:55:13

本帖最后由 s060403072 于 2019-5-16 20:04 编辑

currentSize=944720, freeSize=1287545520, maxSize=1288490240, heapSize=944720, minSize=1224065664
可用空间可能太少了。
看下这台机器的配置跟其他机器的配置。
余额Quota 是如何配置的。

evababy 发表于 2019-5-17 08:56:37

s060403072 发表于 2019-5-16 19:55
currentSize=944720, freeSize=1287545520, maxSize=1288490240, heapSize=944720, minSize=1224065664
可 ...

HDFS没有配置Quota，不过这台机器磁盘是500G其他机器是1T的，数据总部平衡，特意手动平衡过数据。

HBASE 的namespace quota 也没有启用，所有琪琪日志都输出了“Quota support disabled”

昨天继续查问题，也没结果，一点错误看不到。
1号机器：Current list of replicators: other RSs:

2号机器：Current list of replicators: other RSs:

3号机器（问题机）：Current list of replicators: other RSs:

4号机器：Current list of replicators: other RSs:

bioger_hit 发表于 2019-5-17 09:00:33

evababy 发表于 2019-5-17 08:56
HDFS没有配置Quota，不过这台机器磁盘是500G其他机器是1T的，数据总部平衡，特意手动平衡过数据。

...

这没什么问题的

你可以看他们的剩余空间，其实都是一样的。
500G机器，可能有其他大量的数据。

evababy 发表于 2019-5-17 09:08:13

bioger_hit 发表于 2019-5-17 09:00
这没什么问题的

hbase没有错误，感觉有点像启动时感知这台机器总资源少，所以就特意歇菜了，直接吧数据分给个其他机器加载。
只能继续查，不敢往正式上。

evababy 发表于 2019-5-17 09:14:16

HDFS 数据检查一切OK
HBASE 数据检查也OK
磁盘虽然大小不同，但是占用比正常，算是OK
CPU 一位数也OK
内存有足够的剩余内存，也OK
网络全部1000M 实则传输速率108M，也OK

我就不相信了，查不出问题。

evababy 发表于 2019-5-17 09:34:24

disable了一些表，数据量是原来1/4左右频繁启动，启动时间在20秒左右，错误几率下降，启动了4次才出现一次错误，也基本排除了超时等问题导致的数据转移。

0.98版启动master很快，控制台会逐渐显示个regionserver加载数据以及数量的变化。但是1.4.9版本启动后无法反正控制台，只有全部regionserver加载完region后才可以访问到控制台，这个功能是从1.X版本开始的吗？

bioger_hit 发表于 2019-5-17 11:05:04

evababy 发表于 2019-5-17 09:34
disable了一些表，数据量是原来1/4左右频繁启动，启动时间在20秒左右，错误几率下降，启动了4次才出现 ...

这个功能可以，起到保护功能。如果还没加载完数据，就各种操作，出问题的可能性非常大。这就相当于未加载完，集群处于受保护状态。

evababy 发表于 2019-5-17 11:42:57

多谢。

继续查问题：记录了几次日志情况，省略其他启动完全一致的日志后差异部分：
1号机器：
第一次错误：Current list of replicators: other RSs:
第二次错误：Current list of replicators: other RSs:
第三次正确：Current list of replicators: other RSs:
第四次正确：Current list of replicators: other RSs:

2号机器：
第一次错误：Current list of replicators: other RSs:
第二次错误：Current list of replicators: other RSs:
第三次正确：Current list of replicators: other RSs:
第四次正确：Current list of replicators: other RSs:

3号机器（问题机）：
第一次错误：Current list of replicators: other RSs:
第二次错误：Current list of replicators: other RSs:
第三次正确：Current list of replicators: other RSs:
第四次正确：Current list of replicators: other RSs:

4号机器：
第一次错误：Current list of replicators: other RSs:
第二次错误：Current list of replicators: other RSs:
第三次正确：Current list of replicators: other RSs:
第四次正确：Current list of replicators: other RSs:

跟源码如下：
List<String> otherRegionServers = replicationTracker.getListOfRegionServers();
LOG.info("Current list of replicators: " + currentReplicators + " other RSs: "
   + otherRegionServers);

// Look if there's anything to process after a restart
for (String rs : currentReplicators) {
   if (!otherRegionServers.contains(rs)) {
   transferQueues(rs);
   }
}
如果 other RSs 不包含Current list of replicators，就复制给Current list of replicators，但是查看问题机的日期，程序并不会执行transferQueues

虽然现在还不理解Current list of replicators 和 other RSs干什么用的，但是发现问题机“gladslave3”从未在 1、2、4号机器日志冲出现。是否可以说明“正常机”没使用过“问题机”做备份？replicators是否可以理解成HDFS的replication？

evababy 发表于 2019-5-17 14:09:00

他大爷的，吧日志级别改成debug后，测试重启20几次没出现region=0的情况，但是4台机器的数量稍有变化，不是上一次停机的数量！！！难道每次启动都做动平衡吗？

页: [1] 2

About云-梭伦科技's Archiver

救命啊，启动后特定机器无法加载region导致迁移给其他机器