hyj 发表于 2014-6-30 09:47:26

mapreduce 多种输入

问题导读:
1.如何多个路径,mapreduce如何实现?
2.多种输入,mapreduce如何实现?

static/image/hrline/4.gif




1.多路径输入

1)FileInputFormat.addInputPath 多次调用加载不同路径

FileInputFormat.addInputPath(job, new Path("hdfs://RS5-112:9000/cs/path1"));
FileInputFormat.addInputPath(job, new Path("hdfs://RS5-112:9000/cs/path2"));

2)FileInputFormat.addInputPaths一次调用加载 多路径字符串用逗号隔开

FileInputFormat.addInputPaths(job, "hdfs://RS5-112:9000/cs/path1,hdfs://RS5-112:9000/cs/path2");

2.多种输入

MultipleInputs可以加载不同路径的输入文件,并且每个路径可用不同的maper
MultipleInputs.addInputPath(job, new Path("hdfs://RS5-112:9000/cs/path1"), TextInputFormat.class,MultiTypeFileInput1Mapper.class);

MultipleInputs.addInputPath(job, new Path("hdfs://RS5-112:9000/cs/path3"), TextInputFormat.class,MultiTypeFileInput3Mapper.class);

例子:
package example;


import java.io.IOException;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* 多类型文件输入
* @author lijl
*
*/


public class MultiTypeFileInputMR {
static class MultiTypeFileInput1Mapper extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key,Text value,Context context){
try {
String[] str = value.toString().split("\\|");
context.write(new Text(str), new Text(str));
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
static class MultiTypeFileInput3Mapper extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key,Text value,Context context){
try {
String[] str = value.toString().split("");
context.write(new Text(str), new Text(str));
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
static class MultiTypeFileInputReducer extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key,Iterable<Text> values,Context context){
try {
for(Text value:values){
context.write(key,value);
}

} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
conf.set("mapred.textoutputformat.separator", ",");
Job job = new Job(conf,"MultiPathFileInput");
job.setJarByClass(MultiTypeFileInputMR.class);
FileOutputFormat.setOutputPath(job, new Path("hdfs://RS5-112:9000/cs/path6"));

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setReducerClass(MultiTypeFileInputReducer.class);
job.setNumReduceTasks(1);
MultipleInputs.addInputPath(job, new Path("hdfs://RS5-112:9000/cs/path1"), TextInputFormat.class,MultiTypeFileInput1Mapper.class);
MultipleInputs.addInputPath(job, new Path("hdfs://RS5-112:9000/cs/path3"), TextInputFormat.class,MultiTypeFileInput3Mapper.class);
System.exit(job.waitForCompletion(true)?0:1);
}


}








雨雪中的fish 发表于 2014-8-26 10:51:57

请问下楼主啊啊 FileInputFormat.addInputPath(job, new Path(args)); 后面的Path(args)是什么啊?

hyj 发表于 2014-8-27 11:44:38

雨雪中的fish 发表于 2014-8-26 10:51
请问下楼主啊啊 FileInputFormat.addInputPath(job, new Path(args)); 后面的Path(args)是什么啊?
java程序有一个主方法,是这样的public static void main(String [] args)
你说的args就是你用命令行编译运行java程序时,传入的第一个参数,比如你运行一个程序,代码如下:

public class Test{
    public static void main(String [] args){
      for(int i=0;i<args.length;i++)
            System.out.println(args);
    }
}


编译
javac Test.java
运行
java Test param1 param2 回车
你得到的结果是
param1
param2
也就是说args是你传入的第一个参数args是传入的第二个参数,以此类推。

程序猿的无奈 发表于 2016-6-3 16:16:41

楼主,你知道hadoop2.5怎么实现多个mapper输入吗

Kevin517 发表于 2016-11-12 14:35:03

请问楼主,我如果想对 .doc/ .pdf 的文件进行统计,该如何从 HDFS 上获取内容?

直接读的话,好像不行。。。

jiewuzhe02 发表于 2018-1-24 08:16:23

不错的
页: [1]
查看完整版本: mapreduce 多种输入