分享

删除DataNode遇到Decommission In Progress不变化问题

poptang4 发表于 2013-10-25 10:42:08 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 4 17431
我在删除一个节点时,遇到一个
从block的数目上看,该删除DataNode的所有block都已经移到到其他的DataNode上了
但是该待删除的DataNode一直为Decommission In Progress
从日志上看:
该DataNode一直在进行DataBlockScanner,日志如下:
2012-09-24 16:47:51,301 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_3921915826611683588_8027. Its ok since it not in datanode dataset anymore.
2012-09-24 16:51:32,742 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_-902553692494195260_8082. Its ok since it not in datanode dataset anymore.
2012-09-24 16:56:06,284 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_9180843115150344245_25940. Its ok since it not in datanode dataset anymore.
2012-09-24 16:56:49,371 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_-5002984851235119528_31752. Its ok since it not in datanode dataset anymore.
2012-09-24 16:57:10,412 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_2897706191202607999_15164. Its ok since it not in datanode dataset anymore.
2012-09-24 16:57:24,440 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_3569549920590190118_27137. Its ok since it not in datanode dataset anymore.
2012-09-24 16:58:45,601 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_9070131173655120747_33437. Its ok since it not in datanode dataset anymore.
2012-09-24 16:59:23,681 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_-8158061507642883210_23376. Its ok since it not in datanode dataset anymore.
2012-09-24 16:59:32,698 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_1569989466833547348_21624. Its ok since it not in datanode dataset anymore.
2012-09-24 16:59:36,707 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_5530553709255233999_32763. Its ok since it not in datanode dataset anymore.
2012-09-24 16:59:38,711 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_7165837478027158747_24918. Its ok since it not in datanode dataset anymore.
NameNode的日志如下:(这个一直都有,我在想是不是这个的原因,10.10.10.150就是要删除的DataNode)
2012-09-24 17:04:39,082 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Block: blk_7215589835729902944_1601, Expected Replicas: 10, live replicas: 3, corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.10.10.158:50010 10.10.10.151:50010 10.10.10.150:50010 10.10.10.23:50010 , Current Datanode: 10.10.10.150:50010, Is current datanode decommissioning: true

已有(4)人评论

跳转到指定楼层
sq331335144 发表于 2013-10-25 10:42:08
这个大概找到了原因
你看到了Expected Replicas为10,不是配置的为3吗吗?为什么会出现10了,这个就要从MR的JobClient中说起了
在JobClient提交MR的时候,会上传MR的jar,在上传jar包到hdfs时指定了replication
该replication通过配置文件指定,默认为10
该replication你可以在你提交MR后到你的HDFS中验证(前提你的MR还没有执行完成)
MR执行完成后会删除该上传的Jar(猜测,因为成功的MR在HDFS中不存在其jar)
若该MR失败,其jar资源没有被删除,就出现了需要复制10份的要求了
我们可以使用hadoop fsck /这个命令来检查整个文件系统的健康情况
如果出现/data/hadoopdata/tmp/mapred/staging/root/.staging/job_201209061129_2679/job.jar:  Under replicated blk_-8552132555561444140_34127. Target Replicas is 10 but found 4 replica(s).
类似的信息,就说明复制份数不够了(我这里就只有4个DataNode)
好了,解释了为什么会出现“Expected Replicas为10”
在我们删除DataNode的时候,你会发现数据都已经移动走了(一般指block数),但是其状态不会变化,是因为NameNode检测到这个block的Replicas不够,他认为数据不完整,所以他一直都不会让这个DataNode下架。(此处只是大概的看了一下代码,有很大的猜测成分,不对的请各位大侠指出)
解决方案(未测试,明天上班检验)
将这些执行失败的MR的jar进行删除
当然我认为这些MR的jar根本也就没有什么左右,可以直接把这个DataNode停掉

回复

使用道具 举报

oChengZi1234 发表于 2013-10-25 10:42:08
如果你胆子大点,可以不用直接运行退役这个节点的命令,可以直接把这个节点下掉,系统会自动检查自己的副本是否有缺失,然后会复制这些数据(这个过程是异步的)。
回复

使用道具 举报

dgxl 发表于 2013-10-25 10:42:08
你说的这种方案是我们也想到了,在没有办法的情况下再采用
最终删除这些没用的MR的jar,就ok了

回复

使用道具 举报

llike90 发表于 2013-10-25 10:42:08
[ol]
  •   private static FSDataOutputStream createFile(FileSystem fs, Path splitFile,
  •       Configuration job)  throws IOException {
  •     FSDataOutputStream out = FileSystem.create(fs, splitFile,
  •         new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION));
  •     int replication = job.getInt("mapred.submit.replication", 10);  
  •     fs.setReplication(splitFile, (short)replication);
  •     writeSplitHeader(out);
  •     return out;
  •   }[/ol]复制代码就是上面这个10.不知为何会设为10,有何故事吗?难道作者是在个1000节点的群集上写的这段程序,随手写了个10吗?
  • 回复

    使用道具 举报

    您需要登录后才可以回帖 登录 | 立即注册

    本版积分规则

    关闭

    推荐上一条 /2 下一条