分享

hadoop2 运行报错 Unable to close file because the last block does not have en...

wscl1213 发表于 2016-2-25 16:36:56 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 2 39426
hadoop2 运行报错
hadoop2 运行mapreduce程序在reduce阶段时出现错误提示: Unable to close file because the last block does not have enough number of replicas.

16/02/24 04:43:03 INFO mapreduce.Job:  map 0% reduce 0%
16/02/24 04:43:15 INFO mapreduce.Job:  map 48% reduce 0%
16/02/24 04:43:18 INFO mapreduce.Job:  map 58% reduce 0%
16/02/24 04:43:21 INFO mapreduce.Job:  map 81% reduce 0%
16/02/24 04:43:22 INFO mapreduce.Job:  map 100% reduce 0%
16/02/24 04:43:33 INFO mapreduce.Job:  map 100% reduce 71%
16/02/24 04:43:36 INFO mapreduce.Job:  map 100% reduce 72%
16/02/24 04:43:58 INFO mapreduce.Job: Task Id : attempt_1456260024944_0001_r_000000_0, Status : FAILED
Error: java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
  at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2151)
  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2119)
  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
  at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:111)
  at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:563)
  at TimeConvert$MultioutRedu.cleanup(TimeConvert.java:53)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/02/24 04:44:00 INFO mapreduce.Job:  map 100% reduce 0%
16/02/24 04:44:15 INFO mapreduce.Job:  map 100% reduce 72%
16/02/24 04:44:38 INFO mapreduce.Job: Task Id : attempt_1456260024944_0001_r_000000_1, Status : FAILED
Error: java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
  at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2151)
  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2119)
  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
  at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:111)
  at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:563)
  at TimeConvert$MultioutRedu.cleanup(TimeConvert.java:53)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/02/24 04:44:39 INFO mapreduce.Job:  map 100% reduce 0%

已有(2)人评论

跳转到指定楼层
easthome001 发表于 2016-2-25 18:48:05
本帖最后由 easthome001 于 2016-2-25 18:50 编辑


2.解决办法
以通过调整参数dfs.client.block.write.locateFollowingBlock.retries的值来增加retry的次数,可以将值设置为6,那么中间睡眠等待的时间为400ms、800ms、1600ms、3200ms、6400ms、12800ms,也就是说close函数最多要50.8秒才能返回。
中软不建议修改该参数:
但是该dfs.client.block.write.locateFollowingBlock.retries 在开源配置中不开放,调整参数也只能规避问题,若CPU负荷很大的情况,依然会存在该问题。

建议降低任务并发量或者控制cpu使用率来减轻网络的传输,使得DN能顺利向NN汇报block情况。

回复

使用道具 举报

easthome001 发表于 2016-2-25 18:48:48
本帖最后由 easthome001 于 2016-2-25 18:50 编辑

1.分析过程



1.       TSP HDFS客户端是在2015-05-27 18:50:24,232开始写/tsp/nedata/collect/UGW/ugwufdr/20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338的。其中分配的块是blk_1099105501_25370893:

2015-05-27 18:50:24,232 | INFO  | IPC Server handler 30 on 25000 | BLOCK* allocateBlock: /tsp/nedata/collect/UGW/ugwufdr/20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338. BP-1803470917-172.29.57.33-1428597734132 blk_1099105501_25370893{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-b2d7b7d0-f410-4958-8eba-6deecbca2f87:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-76bd80e7-ad58-49c6-bf2c-03f91caf750f:NORMAL|RBW]]} | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveAllocatedBlock(FSNamesystem.java:3166)



2.       写完之后TSP HDFS客户端调用了fync:

2015-05-27 19:00:22,717 | INFO  | IPC Server handler 22 on 25000 | BLOCK* fsync: /tsp/nedata/collect/UGW/ugwufdr/20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338 for DFSClient_NONMAPREDUCE_-120525246_15 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.fsync(FSNamesystem.java:3805)



3.       然后TSP HDFS客户端调用close关闭文件,NameNode收到客户端的close请求之后就会检查最后一个块的完成状态,只有当有足够的DataNode 上报了块完成才可用关闭文件,检查块完成的状态是通过checkFileProgress函数检查的,打印如下:

2015-05-27 19:00:27,603 | INFO  | IPC Server handler 44 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197)

2015-05-27 19:00:28,005 | INFO  | IPC Server handler 45 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197)

2015-05-27 19:00:28,806 | INFO  | IPC Server handler 63 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197)

2015-05-27 19:00:30,408 | INFO  | IPC Server handler 43 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197)

2015-05-27 19:00:33,610 | INFO  | IPC Server handler 37 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197)

2015-05-27 19:00:40,011 | INFO  | IPC Server handler 37 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197)



4.       NameNode打印了多次checkFileProgress是由于TSP HDFS客户端多次尝试close文件,但是由于当前状态不满足要求,导致close失败,TSP HDFS客户端retry的次数是由参数dfs.client.block.write.locateFollowingBlock.retries决定的,

该参数默认是5,所以在NameNode的日志中看到了6次checkFileProgress打印。



5.       但是再过0.5之后,DataNode就上报块已经成功写入了:

2015-05-27 19:00:40,608 | INFO  | IPC Server handler 60 on 25000 | BLOCK* addStoredBlock: blockMap updated: 192.168.10.21:25009 is added to blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} size 11837530 | org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.logAddStoredBlock(BlockManager.java:2393)

2015-05-27 19:00:48,297 | INFO  | IPC Server handler 37 on 25000 | BLOCK* addStoredBlock: blockMap updated: 192.168.10.10:25009 is added to blk_1099105501_25370893 size 11837530 | org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.logAddStoredBlock(BlockManager.java:2393)



6.       DataNode上报块写成功通知延迟的原因可能有:网络瓶颈导致、CPU瓶颈导致。



7.       如果此时再次调用close或者close的retry的次数增多,那么close都将返回成功。建议适当增大参数dfs.client.block.write.locateFollowingBlock.retries的,默认值为5次,尝试的时间间隔为400ms、800ms、1600ms、3200ms、6400ms,12800ms,

那么close函数最多需要25.2秒才能返回。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条