今早巡检已经运行了1年多的hadoop集群,发现一datanode挂了,报错信息:
2015-06-08 08:52:16,105 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode hadoop-master1/192.168.32.11:8020 using DELETEREPORT_INTERVAL of 300000 msec BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000
2015-06-08 08:52:16,105 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1414312971-192.168.32.11-1392479369615 (storage id DS-1944699663-192.168.32.94-50010-1425888569512) service to hadoop-master1/192.168.32.11:8020
于是对数据分区访问,发现其中一数据分区不能访问(我们是一块磁盘对应一个数据分区,共10分区,并没做raid)。
在hdfs-site.xml中增加配置:
<property>
<name>dfs.datanode.failed.volumes.tolerated</name>
<value>1</value>
</property>