原因是:
dfs.datanode.failed.volumes.tolerated 这个参数直接拷贝了线上的配置为1,
其含义是:The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown. 即datanode可以忍受的磁盘损坏的个数。 在hadoop集群中,经常会发生磁盘只读或者损坏的情况。datanode在启动时会使用dfs.datanode.data.dir下配置的文件夹(用来存储block),若是有一些不可以用且个数>上面配置的值,DataNode就会启动失败。 在线上环境中fs.datanode.data.dir配置为10块盘,所以dfs.datanode.failed.volumes.tolerated设置为1,是允许有一块盘是坏的。而线下的只有一块盘,这volFailuresTolerated和volsConfigured的值都为1,所以会导致代码里面判断失败。 详见 hadoop源码的FsDatasetImpl.java的182行: - // The number of volumes required for operation is the total number
- // of volumes minus the number of failed volumes we can tolerate.
- final int volFailuresTolerated =
- conf.getInt(DFSConfigKeys.DFS_DATANODE_FAILED_VOLUMES_TOLERATED_KEY,
- DFSConfigKeys.DFS_DATANODE_FAILED_VOLUMES_TOLERATED_DEFAULT);
-
- String[] dataDirs = conf.getTrimmedStrings(DFSConfigKeys.DFS_DATANODE_DATA_DIR_KEY);
-
- int volsConfigured = (dataDirs == null) ? 0 : dataDirs.length;
- int volsFailed = volsConfigured - storage.getNumStorageDirs();
- this.validVolsRequired = volsConfigured - volFailuresTolerated;
-
- if (volFailuresTolerated < 0 || volFailuresTolerated >= volsConfigured) {
- throw new DiskErrorException("Invalid volume failure "
- + " config value: " + volFailuresTolerated);
- }
- if (volsFailed > volFailuresTolerated) {
- throw new DiskErrorException("Too many failed volumes - "
- + "current valid volumes: " + storage.getNumStorageDirs()
- + ", volumes configured: " + volsConfigured
- + ", volumes failed: " + volsFailed
- + ", volume failures tolerated: " + volFailuresTolerated);
- }
复制代码
|