Hive RCFile合并作业产生重复数据问题

about云腾讯认证空间

问题导读：
hive插入数据产生重复数据的原因是什么？
hive是如何解决的？

前几天有DW用户反馈，在往一张表（RCFile表）中用“insert overwrite table partition(xx) select ...” 插入数据的时候，会产生重复文件。看了下这个作业log，发现map task 000005起了两个task attempt ，第二个attempt是推测执行，并且这两个attemp都在task close函数里面重命名temp文件成正式文件，而不是通过mapreduce框架的两阶段提交协议(two phrase commit protocol)在收到tasktracker发过来的commitTaskAction时再commit task来保证只有一个attemp的结果成为正式结果。

task log中的输出如下：

attempt_201304111550_268224_m_000005_0  
renamed path hdfs://10.2.6.102/tmp/hive-deploy/hive_2013-05-30_10-13-59_124_8643833043783438119/_task_tmp.-ext-10000/hp_cal_month=2013-04/_tmp.000005_0 to hdfs://10.2.6.102/tmp/hive-deploy/hive_2013-05-30_10-13-59_124_8643833043783438119/_tmp.-ext-10000/hp_cal_month=2013-04/000005_0 . File size is 666922  
  
attempt_201304111550_268234_m_000005_1  
renamed path hdfs://10.2.6.102/tmp/hive-deploy/hive_2013-05-30_10-13-59_124_8643833043783438119/_task_tmp.-ext-10000/hp_cal_month=2013-04/_tmp.000005_1 to hdfs://10.2.6.102/tmp/hive-deploy/hive_2013-05-30_10-13-59_124_8643833043783438119/_tmp.-ext-10000/hp_cal_month=2013-04/000005_1 . File size is 666922  
复制代码

其实这条hive语句原本只会起1个job（Launching Job 1 out of 1），当第一个job结束的时候，会起一个conditional task分析每个partition下平均文件的大小，如果小于hive.merge.smallfiles.avgsize（默认为16MB）, 第一个job又是map-only job，并且打开hive.merge.mapfiles（默认为true），则会另外起一个merge-file job来合并小文件，第二个job中用到RCFileMergeMapper就是合并之前生成的小文件的。workaround的方式有两种，一种是关闭推测执行，不过会有可能有一个task比较慢导致瓶颈，另一种是关闭merge file job(set hive.merge.mapfiles=fasle)，从而不使用RCFileMergeMapper，不过这样产生大量的小文件就无法合并。

如果解析出来是要起merge job，会创建一个BlockMergeTask(继承自Task)，执行里面的execute方法，先设置好JobConf相应的参数，比如mapred.mapper.class, hive.rcfile.merge.output.dir等，然后创建一个JobClient并submitJob，map的执行逻辑在RCFileMergeMapper类中，它继承了老的MapRed API中的抽象类MapReduceBase，覆盖了configure和close方法，前面提到的rename操作就是在close方法中，MapRunner类中的run方法会循环调用真正执行mapper的map方法，并在finally调用mapper的close方法

public void close() throws IOException {  
  // close writer  
  if (outWriter == null) {  
    return;  
  }  
  
  outWriter.close();  
  outWriter = null;  
  
  if (!exception) {  
    FileStatus fss = fs.getFileStatus(outPath);  
    LOG.info("renamed path " + outPath + " to " + finalPath  
        + " . File size is " + fss.getLen());  
    if (!fs.rename(outPath, finalPath)) {  
      throw new IOException("Unable to rename output to " + finalPath);  
    }  
  } else {  
    if (!autoDelete) {  
      fs.delete(outPath, true);  
    }  
  }  
}  
复制代码

job执行完成后，是有可能有同一task的不同attempt产生结果文件同时存在的，不过hive显然考虑到了这点，所以在merge作业执行后会调用RCFileMergeMapper.jobClose方法，它会先备份输出目录，然后将数据写入输出目录并调用Utilities.removeTempOrDuplicateFiles方法来删除重复文件，删除的逻辑是从文件名中提取taskid，如果同一个taskid有两个文件，则会将小的那个删除，不过在0.9版本中，RCFileMergeMapper对于目标表是动态分区表情况下不支持，所以还会有duplicated files，打上patch（https://issues.apache.org/jira/browse/HIVE-3149?attachmentOrder=asc）后解决问题

RCFileMergeMapper execute方法finally处理逻辑，源代码中catch住exception后无任何处理，我加了一些stack trace输出和设置return value

finally {  
      try {  
        if (ctxCreated) {  
          ctx.clear();  
        }  
        if (rj != null) {  
          if (returnVal != 0) {  
            rj.killJob();  
          }  
          HadoopJobExecHelper.runningJobKillURIs.remove(rj.getJobID());  
          jobID = rj.getID().toString();  
        }  
        RCFileMergeMapper.jobClose(outputPath, success, job, console, work.getDynPartCtx());  
      } catch (Exception e) {  
        console.printError("RCFile Merger Job Close Error", "\n"  
            + org.apache.hadoop.util.StringUtils.stringifyException(e));  
        e.printStackTrace(System.err);  
        success = false;  
        returnVal = -500;  
      }  
    }  
复制代码

jobClose方法

public static void jobClose(String outputPath, boolean success, JobConf job,  
    LogHelper console, DynamicPartitionCtx dynPartCtx) throws HiveException, IOException {  
  Path outpath = new Path(outputPath);  
  FileSystem fs = outpath.getFileSystem(job);  
  Path backupPath = backupOutputPath(fs, outpath, job);  
  Utilities.mvFileToFinalPath(outputPath, job, success, LOG, dynPartCtx, null);  
  if (backupPath != null) {  
    fs.delete(backupPath, true);  
  }  
}  
复制代码

mvFileToFinalPath方法

public static void mvFileToFinalPath(String specPath, Configuration hconf,  
    boolean success, Log log, DynamicPartitionCtx dpCtx, FileSinkDesc conf) throws IOException,  
    HiveException {  
  
  FileSystem fs = (new Path(specPath)).getFileSystem(hconf);  
  Path tmpPath = Utilities.toTempPath(specPath);  
  Path taskTmpPath = Utilities.toTaskTempPath(specPath);  
  Path intermediatePath = new Path(tmpPath.getParent(), tmpPath.getName()  
      + ".intermediate");  
  Path finalPath = new Path(specPath);  
  if (success) {  
    if (fs.exists(tmpPath)) {  
      // Step1: rename tmp output folder to intermediate path. After this  
      // point, updates from speculative tasks still writing to tmpPath  
      // will not appear in finalPath.  
      log.info("Moving tmp dir: " + tmpPath + " to: " + intermediatePath);  
      Utilities.rename(fs, tmpPath, intermediatePath);  
      // Step2: remove any tmp file or double-committed output files  
      ArrayList<String> emptyBuckets =  
          Utilities.removeTempOrDuplicateFiles(fs, intermediatePath, dpCtx);  
      // create empty buckets if necessary  
      if (emptyBuckets.size() > 0) {  
        createEmptyBuckets(hconf, emptyBuckets, conf);  
      }  
  
      // Step3: move to the file destination  
      log.info("Moving tmp dir: " + intermediatePath + " to: " + finalPath);  
      Utilities.renameOrMoveFiles(fs, intermediatePath, finalPath);  
    }  
  } else {  
    fs.delete(tmpPath, true);  
  }  
  fs.delete(taskTmpPath, true);  
}  
复制代码

removeTempOrDuplicateFiles(FileSystem fs, Path path, DynamicPartitionCtx dpCtx)，动态分区和非动态分区表不同处理逻辑

/** 
   * Remove all temporary files and duplicate (double-committed) files from a given directory. 
   * 
   * @return a list of path names corresponding to should-be-created empty buckets. 
   */  
  public static ArrayList<String> removeTempOrDuplicateFiles(FileSystem fs, Path path,  
      DynamicPartitionCtx dpCtx) throws IOException {  
    if (path == null) {  
      return null;  
    }  
  
    ArrayList<String> result = new ArrayList<String>();  
    if (dpCtx != null) {  
      FileStatus parts[] = getFileStatusRecurse(path, dpCtx.getNumDPCols(), fs);  
      HashMap<String, FileStatus> taskIDToFile = null;  
  
      for (int i = 0; i < parts.length; ++i) {  
        assert parts[i].isDir() : "dynamic partition " + parts[i].getPath()  
            + " is not a direcgtory";  
        FileStatus[] items = fs.listStatus(parts[i].getPath());  
  
        // remove empty directory since DP insert should not generate empty partitions.  
        // empty directories could be generated by crashed Task/ScriptOperator  
        if (items.length == 0) {  
          if (!fs.delete(parts[i].getPath(), true)) {  
            LOG.error("Cannot delete empty directory " + parts[i].getPath());  
            throw new IOException("Cannot delete empty directory " + parts[i].getPath());  
          }  
        }  
  
        taskIDToFile = removeTempOrDuplicateFiles(items, fs);  
        // if the table is bucketed and enforce bucketing, we should check and generate all buckets  
        if (dpCtx.getNumBuckets() > 0 && taskIDToFile != null) {  
          // refresh the file list  
          items = fs.listStatus(parts[i].getPath());  
          // get the missing buckets and generate empty buckets  
          String taskID1 = taskIDToFile.keySet().iterator().next();  
          Path bucketPath = taskIDToFile.values().iterator().next().getPath();  
          for (int j = 0; j < dpCtx.getNumBuckets(); ++j) {  
            String taskID2 = replaceTaskId(taskID1, j);  
            if (!taskIDToFile.containsKey(taskID2)) {  
              // create empty bucket, file name should be derived from taskID2  
              String path2 = replaceTaskIdFromFilename(bucketPath.toUri().getPath().toString(), j);  
              result.add(path2);  
            }  
          }  
        }  
      }  
    } else {  
      FileStatus[] items = fs.listStatus(path);  
      removeTempOrDuplicateFiles(items, fs);  
    }  
    return result;  
  }
复制代码

removeTempOrDuplicateFiles(FileStatus[] items, FileSystem fs)，对于每个目录下相同taskid不同attemptid的文件进行去重

public static HashMap<String, FileStatus> removeTempOrDuplicateFiles(FileStatus[] items,  
    FileSystem fs) throws IOException {  
  
  if (items == null || fs == null) {  
    return null;  
  }  
  
  HashMap<String, FileStatus> taskIdToFile = new HashMap<String, FileStatus>();  
  
  for (FileStatus one : items) {  
    if (isTempPath(one)) {  
      if (!fs.delete(one.getPath(), true)) {  
        throw new IOException("Unable to delete tmp file: " + one.getPath());  
      }  
    } else {  
      String taskId = getTaskIdFromFilename(one.getPath().getName());  
      FileStatus otherFile = taskIdToFile.get(taskId);  
      if (otherFile == null) {  
        taskIdToFile.put(taskId, one);  
      } else {  
        // Compare the file sizes of all the attempt files for the same task, the largest win  
        // any attempt files could contain partial results (due to task failures or  
        // speculative runs), but the largest should be the correct one since the result  
        // of a successful run should never be smaller than a failed/speculative run.  
        FileStatus toDelete = null;  
        if (otherFile.getLen() >= one.getLen()) {  
          toDelete = one;  
        } else {  
          toDelete = otherFile;  
          taskIdToFile.put(taskId, one);  
        }  
        long len1 = toDelete.getLen();  
        long len2 = taskIdToFile.get(taskId).getLen();  
        if (!fs.delete(toDelete.getPath(), true)) {  
          throw new IOException("Unable to delete duplicate file: " + toDelete.getPath()  
              + ". Existing file: " + taskIdToFile.get(taskId).getPath());  
        } else {  
          LOG.info("Duplicate taskid file removed: " + toDelete.getPath() + " with length "  
              + len1 + ". Existing file: " + taskIdToFile.get(taskId).getPath() + " with length "  
              + len2);  
        }  
      }  
    }  
  }  
  return taskIdToFile;  
}  
复制代码

链接http://blog.csdn.net/lalaguozhe/article/details/9095679

图文精华

Hive RCFile合并作业产生重复数据问题

活跃会员

热心会员

推广达人

优秀版主

推荐 /2