about云源码分析之hadoop2.7.1 mapreduce（wordcount为例）

问题导读
1.你认为GenericOptionsParser类的作用是什么？
2.Options类你认为它的作用是什么？
3.提交job中，面对多输入路径，如何区分输入输出路径？

hadoop2.7.1发布，我们看到其自带例子源码，wordcount在编程方面有了新的变化，不在使用ToolRunner，而是采用了新的写法。
获取源码参考
从零教你如何获取hadoop2.X源码并使用eclipse关联hadoop2.X源码
http://www.aboutyun.com/thread-14244-1-1.html

下面我们来看看这个例子：

package wordcount;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class wordcount {
      public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
               ) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
   word.set(itr.nextToken());
   context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
                  Context context
                  ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
   sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(wordcount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs));
}
FileOutputFormat.setOutputPath(job,
new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
在我们前面版本中，使用ToolRunner，而这里没有使用，那么到底是哪个类代替了ToolRunner。答案是GenericOptionsParser。那么这个类的作用是什么？

GenericOptionsParser是hadoop框架中解析命令行参数的基本类。
如果分析过前面版本源码GenericOptionsParser并不陌生，因为在ToolRunner中也被使用了。
这里apache在mapreduce中不使用ToolRunner，其实方便了我们编程。

#######################################################

我们知道GenericOptionsParser  的作用，那么它具有三个参数
Configuration类这个应该是操作配置文件
Options类是一个选项对象的集合，用于描述在应用中可能使用到的命令行参数。
String[] args参数则是传递的main(String[] args)的args

在hadoop2.7.1 wordcount中
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

getRemainingArg函数源码如下：
public String[] getRemainingArgs() {
return (commandLine == null) ? new String[]{} : commandLine.getArgs();
  }

意思是如果commandLine为null则返回字符串对象，否则返回 commandLine.getArgs();

那么这个otherArgs到底是什么意思？
我们来看下面代码：

for (int i = 0; i < otherArgs.length - 1; ++i) {
   FileInputFormat.addInputPath(job, new Path(otherArgs));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));

}
我们看到FileInputFormat 循环并且把循环项都添加到了输入路径中addInputPath 但是唯独没有把最后一个加入
那么最后一个是什么？这时候我们看到setOutputPath
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
也就是说最后一项被视为输出路径。

从上面代码中我们的出，hadoop2.7.1中
hadoop jar xx.jar pathA pathB pathC ... pathX
我们会看到pathA pathB pathC ...为输入路径
pathX 则为输出路径。

tang · 发表于 2015-7-13 08:09:29

图文精华

about云源码分析之hadoop2.7.1 mapreduce（wordcount为例）

已有(1)人评论

活跃会员

热心会员

优秀版主

论坛元老

推荐 /2