hadoop作业提交脚本分析（2）

本帖最后由 howtodown 于 2014-2-21 17:11 编辑
此篇是在hadoop作业提交脚本分析（1）的基础上
阅读本文可以带着下面问题：
1.能否把下面红字部分：传入Hadoop Home的地址，变成用户可以选择的？
2.提交作业时，要把conf放到什么中？

前面对Hadoop的作业提交流程基本明了了，下面我们就可以开始编写代码模拟这个流程。

　　第一步要做的是添加Hadoop的依赖库和配置文件到classpath。最常用的方法就是用一个容器先把各个要添加到classpath的文件或文件夹存储起来，后面再作为类加载器的URL搜索路径。

/**
   * Add a directory or file to classpath.
   * 
   * @param component
   */
  public static void addClasspath(String component) {
    if ((component != null) && (component.length() > 0)) {
      try {
    File f = new File(component);
    if (f.exists()) {
      URL key = f.getCanonicalFile().toURL();
      if (!classPath.contains(key)) {
        classPath.add(key);
      }
    }
      } catch (IOException e) {
      }
    }
  }
复制代码

上面的classPath变量就是我们声明用来装载classpath组件的容器。

private static ArrayList<URL> classPath = new ArrayList<URL>();
复制代码

由于需要添加一些文件夹下的所有Jar包，所以我们还要实现一个遍历添加某文件夹下文件的方法。

/**
   * Add all jars in directory to classpath, sub-directory is excluded.
   * 
   * @param dirPath
   */
  public static void addJarsInDir(String dirPath) {
    File dir = new File(dirPath);
    if (!dir.exists()) {
      return;
    }
    File[] files = dir.listFiles();
    if (files == null) {
      return;
    }
    for (int i = 0; i < files.length; i++) {
      if (files[i].isDirectory()) {
        continue;
      } else {
    addClasspath(files[i].getAbsolutePath());
      }
    }
  }
复制代码

简单起见，这个方法没有使用Filter，对文件夹下的文件是通吃，也忽略掉了子文件夹，只处理根文件夹。

　　好了，有了基础方法，下面就是照着bin/hadoop中脚本所做的，把相应classpath添加进去。

/**
   * Add default classpath listed in bin/hadoop bash.
   * 
   * @param hadoopHome
   */
  public static void addDefaultClasspath(String hadoopHome) {
    // Classpath initially contains conf dir.
    addClasspath(hadoopHome + "/conf");

    // For developers, add Hadoop classes to classpath.
    addClasspath(hadoopHome + "/build/classes");
    if (new File(hadoopHome + "/build/webapps").exists()) {
      addClasspath(hadoopHome + "/build");
    }
    addClasspath(hadoopHome + "/build/test/classes");
    addClasspath(hadoopHome + "/build/tools");

    // For releases, add core hadoop jar & webapps to classpath.
    if (new File(hadoopHome + "/webapps").exists()) {
      addClasspath(hadoopHome);
    }
    addJarsInDir(hadoopHome);
    addJarsInDir(hadoopHome + "/build");

    // Add libs to classpath.
    addJarsInDir(hadoopHome + "/lib");
    addJarsInDir(hadoopHome + "/lib/jsp-2.1");
    addJarsInDir(hadoopHome + "/build/ivy/lib/Hadoop/common");
  }
复制代码

至此，该添加classpath的都已添加好了（未包括第三方库，第三方库可用Conf中的tmpjars属性添加。），下去就是调用RunJar类了。本文为了方便，把RunJar中的两个方法提取了出来，去掉了一些可不要的Hadoop库依赖，然后整合到了类EJob里。主要改变是把原来解压Jar包的“hadoop.tmp.dir”文件夹改为"java.io.tmpdir"，并提取出了fullyDelete方法。

　　利用这个类来提交Hadoop作业很简单，下面是一个示例：

args = new String[4];
    args[0] = "E:\\Research\\Hadoop\\hadoop-0.20.1+152\\hadoop-0.20.1+152-examples.jar";
    args[1] = "pi";
    args[2] = "2";
    args[3] = "100";
    // 传入Hadoop Home的地址，自动添加相应classpath。
    EJob.addDefaultClasspath("E:\\Research\\Hadoop\\hadoop-0.20.1+152");
    EJob.runJar(args);
复制代码

　上面这个示例调用了Hadoop官方例子Jar包里的pi计算例子，传递参数时同bin/hadoop jar *.jar mainclass args命令类似，但是忽略掉了bin/hadoop jar这个命令，因为我们现在不需要这个脚本来提交作业了。新建一个Project，添加一个class，在main里粘上上面的代码，然后Run as Java Application。注意看你的Console，你会发现你已经成功把作业提交到Hadoop上去了。

　　有图有真相，粘一下我的运行示例（在Win上开Eclipse，Hadoop Cluster在Linux，配置文件同Cluster的一样）。

　　下面是在Cloudera Desktop看到的Job信息（它的时间是UTC的）。

　　用上述方法，我们可以做一个类似Cloudera Desktop的Web应用，接受用户提交的Jar，并在Action处理中提交到Hadoop中去运行，然后把结果返回给用户。

　　由于篇幅原因，加上前面介绍过RunJar类，所本文没有粘关于RunJar类的代码，不过你放心，本文提供例子工程下载。你可以在此基础上优化它，添加更多功能。由于大部分是Hadoop的代码，So，该代码基于Apache License。

下载地址

jobutil.rar (9.96 KB, 下载次数: 30)

2014-2-21 16:19 上传

点击文件名下载附件

　　到此，以Java方式提交Hadoop作业介绍完毕。但，是否还可以再进一步呢？现在还只能提交打包好的MR程序，尚不能像Hadoop Eclipse Plugin那样能直接对包含Mapper和Reducer的类Run on Hadoop。为什么直接对这些类Run as Java Application提交的作业是在Local运行的呢？

----------------------------------------------------------------------------------------------------------------------------------------------------

前面我们所分析的部分其实只是Hadoop作业提交的前奏曲，真正的作业提交代码是在MR程序的main里，RunJar在最后会动态调用这个main。我们下面要做的就是要比RunJar更进一步，让作业提交能在编码时就可实现，就像Hadoop Eclipse Plugin那样可以对包含Mapper和Reducer的MR类直接Run on Hadoop。

　　一般来说，每个MR程序都会有这么一段类似的作业提交代码，这里拿WordCount的举例：

Code highlighting produced by Actipro CodeHighlighter (freeware)
http://www.CodeHighlighter.com/-->    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
复制代码

首先要做的是构建一个Configuration对象，并进行参数解析。接着构建提交作业用的Job对象，并设置作业Jar包、对应Mapper和Reducer类、输入输出的Key和Value的类及作业的输入和输出路径，最后就是提交作业并等待作业结束。这些只是比较基本的设置参数，实际还支持更多的设置参数，这里就不一一介绍，详细的可参考API文档。

　　一般分析代码都从开始一步步分析，但我们的重点是分析提交过程中发生的事，这里我们先不理前面的设置对后面作业的影响，我们直接跳到作业提交那一步进行分析，当碰到问题需要分析前面的代码时我会再分析。

　　当调用job.waitForCompletion时，其内部调用的是submit方法来提交，如果传入参数为ture则及时打印作业运作信息，否则只是等待作业结束。submit方法进去后，还有一层，里面用到了job对象内部的jobClient对象的submitJobInternal来提交作业，从这个方法才开始做正事。进去第一件事就是获取jobId，用到了jobSubmitClient对象，jobSubmitClient对应的类是JobSubmissionProtocol的实现之一（目前有两个实现，JobTracker和LocalJobRunner），由此可判断出jobSubmitClient对应的类要么是JobTracker，要么是LocalJobRunner。呃，这下有点想法了，作业提交是上到JobTracker去，还是在本地执行？可能就是看这个jobSunmitClient初始化时得到的是哪个类的实例了，我们可以稍稍的先往后看看，你会发现submitJobInternal最后用了jobSubmitClient.submitJob(jobId)来提交作业，再稍稍看看JobTracker和LocalJobRunner的submitJob实现，看来确实是这么回事。好，那我们就先跳回去看看这个jobSubmitClient是如何初始化的。在JobClient的init中我们可以发现jobSubmitClient的初始化语句：

  String tracker = conf.get("mapred.job.tracker", "local");
    if ("local".equals(tracker)) {
      this.jobSubmitClient = new LocalJobRunner(conf);
    } else {
      this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);
    }     
复制代码

哈，跟conf中的mapred.job.tracker属性有关，如果你没设置，那默认得到的值就是local，jobSubmitClient也就会被赋予LocalJobRunner的实例。平时，我们开发时一般都只是引用lib里面的库，不引用conf文件夹里的配置文件，这里就能解释为什么我们直接Run as Java Application时，作业被提交到Local去运行了，而不是Hadoop Cluster中。那我们把conf文件夹添加到classpath，就能Run on Hadoop了么？目前下结论尚早，我们继续分析（你添加了conf文件夹后，可以提交试一试，会爆出一个很明显的让你知道还差什么的错误，这里我就卖卖官子，先不说）。

　　jobId获取到后，在SystemDir基础上加jobId构建了提交作业的目录submitJobDir，SystemDir由JobClient的getSystemDir方法得出，这个SystemDir在构建fs对象时很重要，确定了返回的fs的类型。下去的configureCommandLineOptions方法主要是把作业依赖的第三方库或文件上传到fs中，并做classpath映射或Symlink，以及一些参数设置，都是些细微活，这里不仔细分析。我们主要关心里面的两个地方，一个是：

FileSystem fs = getFs();
复制代码

看上去很简单，一句话，就是获取FileSystem的实例，但其实里面绕来绕去，有点头晕。因为Hadoop对文件系统进行了抽象，所以这里获得fs实例的类型决定了你是在hdfs上操作还是在local fs上操作。好了，我们冲进去看看。

Code highlighting produced by Actipro CodeHighlighter (freeware)
http://www.CodeHighlighter.com/--> public synchronized FileSystem getFs() throws IOException {
    if (this.fs == null) {
      Path sysDir = getSystemDir();
      this.fs = sysDir.getFileSystem(getConf());
    }
    return fs;
  }
复制代码

看见了吧，fs是由sysDir的getFileSystem返回的。我们再冲，由于篇幅，下面就只列出主要涉及的语句。

Code highlighting produced by Actipro CodeHighlighter (freeware)
http://www.CodeHighlighter.com/-->    FileSystem.get(this.toUri(), conf);
        ↓
    CACHE.get(uri, conf);
        ↓
    fs = createFileSystem(uri, conf);
        ↓
    Class<?> clazz = conf.getClass("fs." + uri.getScheme() + ".impl", null);
    if (clazz == null) {
      throw new IOException("No FileSystem for scheme: " + uri.getScheme());
    }
    FileSystem fs = (FileSystem)ReflectionUtils.newInstance(clazz, conf);
    fs.initialize(uri, conf);
    return fs;
复制代码

又是跟conf有关，看来conf是得实时跟住的。这里用到了Java的反射技术，用来动态生成相应的类实例。其中的class获取与uri.getScheme有密切关系，而uri就是在刚才的sysDir基础上构成，sysDir的值又最终是由jobSubmitClient的实例决定的。如果jobSubmitClient是JobTracker的实例，那Scheme就是hdfs。如果是LocalJobRunner的实例，那就是file。从core-default.xml你可以找到如下的信息：

Code highlighting produced by Actipro CodeHighlighter (freeware)
http://www.CodeHighlighter.com/--><property>
  <name>fs.file.impl</name>
  <value>org.apache.hadoop.fs.LocalFileSystem</value>
  <description>The FileSystem for file: uris.</description>
</property> <property>
  <name>fs.hdfs.impl</name>
  <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
  <description>The FileSystem for hdfs: uris.</description>
</property>
复制代码

所以在前面的作业提交代码中，在初始化Job实例时，很多事已经决定了，由conf文件夹中的配置文件决定。Configuration是通过当前线程上下文的类加载器来加载类和资源文件的，所以要想Run on Hadoop，第一步必须要让Conf文件夹进入Configuration的类加载器的搜索路径中，也就是当前线程上下文的类加载器。

　　第二个要注意的地方是：

Code highlighting produced by Actipro CodeHighlighter (freeware)
http://www.CodeHighlighter.com/-->    String originalJarPath = job.getJar();    if (originalJarPath != null) {           // copy jar to JobTracker's fs      // use jar name if job is not named. 
      if ("".equals(job.getJobName())){
        job.setJobName(new Path(originalJarPath).getName());
      }
      job.setJar(submitJarFile.toString());
      fs.copyFromLocalFile(new Path(originalJarPath), submitJarFile);
      fs.setReplication(submitJarFile, replication);
      fs.setPermission(submitJarFile, new FsPermission(JOB_FILE_PERMISSION));
    } else {
      LOG.warn("No job jar file set.  User classes may not be found. "+
               "See JobConf(Class) or JobConf#setJar(String).");
    }
复制代码

因为client在提交作业到Hadoop时需要把作业打包成jar，然后copy到fs的submitJarFile路径中。如果我们想Run on Hadoop，那就必须自己把作业的class文件打个jar包，然后再提交。在Eclipse中，这就比较容易了。这里假设你启用了自动编译功能。我们可以在代码的开始阶段加入一段代码用来打包bin文件夹里的class文件为一个jar包，然后再执行后面的常规操作。

　　在configureCommandLineOptions方法之后，submitJobInternal会检查输出文件夹是否已存在，如果存在则抛出异常。之后，就开始划分作业数据，并根据split数得到map tasks的数量。最后，就是把作业配置文件写入submitJobFile，并调用jobSubmitClient.submitJob(jobId)最终提交作业。

　　至此，对Hadoop的作业提交分析也差不多了，有些地方讲的比较啰嗦，有些又讲得点到而止，但大体的过程以及一些较重要的东西还是说清楚了，其实就是那么回事。下去的文章我们会在前面的jobUtil基础上增加一些功能来支持Run on Hadoop，其实主要就是增加一个打包Jar的方法。

---------------------------------------------------------------------------------------------------------------------------------------------------

经过上面的分析，我们知道了Hadoop的作业提交目标是Cluster还是Local，与conf文件夹内的配置文件参数有着密切关系，不仅如此，其它的很多类都跟conf有关，所以提交作业时切记把conf放到你的classpath中。

　　因为Configuration是利用当前线程上下文的类加载器来加载资源和文件的，所以这里我们采用动态载入的方式，先添加好对应的依赖库和资源，然后再构建一个URLClassLoader作为当前线程上下文的类加载器。

public static ClassLoader getClassLoader() {
        ClassLoader parent = Thread.currentThread().getContextClassLoader();
        if (parent == null) {
            parent = EJob.class.getClassLoader();
        }
        if (parent == null) {
            parent = ClassLoader.getSystemClassLoader();
        }
        return new URLClassLoader(classPath.toArray(new URL[0]), parent);
    }
复制代码

代码很简单，废话就不多说了。调用例子如下：

  EJob.addClasspath("/usr/lib/hadoop-0.20/conf");
   ClassLoader classLoader = EJob.getClassLoader();
   Thread.currentThread().setContextClassLoader(classLoader);
复制代码

设置好了类加载器，下面还有一步就是要打包Jar文件，就是让Project自打包自己的class为一个Jar包，我这里以标准Eclipse工程文件夹布局为例，打包的就是bin文件夹里的class。

public static File createTempJar(String root) throws IOException {
        if (!new File(root).exists()) {
            return null;
        }
        Manifest manifest = new Manifest();
        manifest.getMainAttributes().putValue("Manifest-Version", "1.0");
        final File jarFile = File.createTempFile("EJob-", ".jar", new File(System
                .getProperty("java.io.tmpdir")));

        Runtime.getRuntime().addShutdownHook(new Thread() {
            public void run() {
                jarFile.delete();
            }
        });

        JarOutputStream out = new JarOutputStream(new FileOutputStream(jarFile),
                manifest);
        createTempJarInner(out, new File(root), "");
        out.flush();
        out.close();
        return jarFile;
    }

    private static void createTempJarInner(JarOutputStream out, File f,
            String base) throws IOException {
        if (f.isDirectory()) {
            File[] fl = f.listFiles();
            if (base.length() > 0) {
                base = base + "/";
            }
            for (int i = 0; i < fl.length; i++) {
                createTempJarInner(out, fl[i], base + fl[i].getName());
            }
        } else {
            out.putNextEntry(new JarEntry(base));
            FileInputStream in = new FileInputStream(f);
            byte[] buffer = new byte[1024];
            int n = in.read(buffer);
            while (n != -1) {
                out.write(buffer, 0, n);
                n = in.read(buffer);
            }
            in.close();
        }
    }
复制代码

这里的对外接口是createTempJar，接收参数为需要打包的文件夹根路径，支持子文件夹打包。使用递归处理法，依次把文件夹里的结构和文件打包到Jar里。很简单，就是基本的文件流操作，陌生一点的就是Manifest和JarOutputStream，查查API就明了。

　　好，万事具备，只欠东风了，我们来实践一下试试。还是拿WordCount来举例：

// Add these statements. XXX
        File jarFile = EJob.createTempJar("bin");
        EJob.addClasspath("/usr/lib/hadoop-0.20/conf");
        ClassLoader classLoader = EJob.getClassLoader();
        Thread.currentThread().setContextClassLoader(classLoader);

        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }

        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCountTest.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
复制代码

Run as Java Application。。。！！！No job jar file set...异常，看来job.setJarByClass(WordCountTest.class)这个语句设置作业Jar包没有成功。这是为什么呢？

因为这个方法使用了WordCount.class的类加载器来寻找包含该类的Jar包，然后设置该Jar包为作业所用的Jar包。但是我们的作业 Jar包是在程序运行时才打包的，而WordCount.class的类加载器是AppClassLoader，运行后我们无法改变它的搜索路径，所以使用setJarByClass是无法设置作业Jar包的。我们必须使用JobConf里的setJar来直接设置作业Jar包，像下面一样：

((JobConf)job.getConfiguration()).setJar(jarFile);
复制代码

好，我们对上面的例子再做下修改，加上上面这条语句。

Job job = new Job(conf, "word count");
// And add this statement. XXX
((JobConf) job.getConfiguration()).setJar(jarFile.toString());
复制代码

再Run as Java Application，终于OK了~~

　　该种方法的Run on Hadoop使用简单，兼容性好，推荐一试。：）

　　本例子由于时间关系，只在Ubuntu上做了伪分布式测试，但理论上是可以用到真实分布式上去的。

jobutil_2.rar (16.8 KB, 下载次数: 2)

图文精华

hadoop作业提交脚本分析（2）

活跃会员

热心会员

推广达人

宣传达人

突出贡献

优秀版主

论坛元老

推荐 /2