多谢。
继续查问题:记录了几次日志情况,省略其他启动完全一致的日志后差异部分:
1号机器:
第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494]
第三次正确:Current list of replicators: [gladslave1,16020,1558058594158] other RSs: [gladslave1,16020,1558058594158]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559]
2号机器:
第一次错误:Current list of replicators: [gladslave2,16020,1557999537309] other RSs: [gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931]
第三次正确:Current list of replicators: [gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] other RSs: [gladslave1,16020,1558058594158, gladslave2,16020,1558058594912]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559] other RSs: [gladslave2,16020,1558063128559]
3号机器(问题机):
第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave1,16020,1557999537835, gladslave3,16020,1557999539916, gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494, gladslave3,16020,1558057204781] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931, gladslave1,16020,1558057203494, gladslave3,16020,1558057204781]
第三次正确:Current list of replicators: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912, gladslave3,16020,1558058596843] other RSs: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912, gladslave3,16020,1558058596843]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave3,16020,1558063131092, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave3,16020,1558063131092]
4号机器:
第一次错误:Current list of replicators: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309] other RSs: [gladslave4,16020,1557999537694, gladslave2,16020,1557999537309]
第二次错误:Current list of replicators: [gladslave4,16020,1558057202931] other RSs: [gladslave2,16020,1558057202848, gladslave4,16020,1558057202931]
第三次正确:Current list of replicators: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912] other RSs: [gladslave4,16020,1558058595950, gladslave1,16020,1558058594158, gladslave2,16020,1558058594912]
第四次正确:Current list of replicators: [gladslave2,16020,1558063128559, gladslave4,16020,1558063130892, gladslave1,16020,1558063128846] other RSs: [gladslave1,16020,1558063128846, gladslave2,16020,1558063128559, gladslave4,16020,1558063130892]
跟源码如下:
[mw_shl_code=java,true]List<String> otherRegionServers = replicationTracker.getListOfRegionServers();
LOG.info("Current list of replicators: " + currentReplicators + " other RSs: "
+ otherRegionServers);
// Look if there's anything to process after a restart
for (String rs : currentReplicators) {
if (!otherRegionServers.contains(rs)) {
transferQueues(rs);
}
}[/mw_shl_code]
如果 other RSs 不包含Current list of replicators,就复制给Current list of replicators,但是查看问题机的日期,程序并不会执行transferQueues
虽然现在还不理解Current list of replicators 和 other RSs干什么用的,但是发现问题机“gladslave3”从未在 1、2、4号机器日志冲出现。是否可以说明“正常机”没使用过“问题机”做备份?replicators是否可以理解成HDFS的replication?
|