问题导读
1、如何用MapReduce程序统计分类情况?
2、为什么需要明确声明输入的格式?
有一个txt文件,内容的格式是这样子的:
- <span style="font-size:18px;">深圳文化衫订做 5729944
- 深圳厂家t恤批发 5729945
- 深圳定做文化衫 5729944
- 文化衫厂家 5729944
- 订做文化衫 5729944
- 深圳t恤厂家 5729945</span>
复制代码
前面是搜索关键词,后面的是所属的分类ID,以tab分隔,想统计分类情况。于是用下面的MapReduce程序跑了下:
- <span style="font-size:18px;">import java.io.IOException;
- import java.util.*;
-
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.conf.*;
- import org.apache.hadoop.io.*;
- import org.apache.hadoop.mapreduce.*;
- import org.apache.hadoop.mapreduce.lib.input.*;
- import org.apache.hadoop.mapreduce.lib.output.*;
- import org.apache.hadoop.util.*;
-
- public class ClassCount extends Configured implements Tool
- {
- public static class ClassMap
- extends Mapper<Text ,Text,Text,IntWritable>
- {
- private static final IntWritable one = new IntWritable(1);
- private Text word = new Text();
-
- public void map(Text key,Text value,Context context)
- throws IOException,InterruptedException
- {
- String eachLine = value.toString();
- StringTokenizer tokenizer = new StringTokenizer(eachLine,"\n");
- while(tokenizer.hasMoreTokens())
- {
- StringTokenizer token = new StringTokenizer(tokenizer.nextToken(),"\t");
- String keyword = token.nextToken();//i don't use it now.
- String classId = token.nextToken();
- word.set(classId);
- context.write(word,one);
- }
- }
- }
-
- public static class Reduce
- extends Reducer<Text,IntWritable,Text,IntWritable>
- {
- public void reduce(Text key,Iterable<IntWritable> values,Context context)
- throws IOException,InterruptedException
- {
- int sum = 0;
- for(IntWritable val : values)
- sum += val.get();
- context.write(key,new IntWritable(sum));
- }
- }
- public int run(String args[]) throws Exception{
- Job job = new Job(getConf());
- job.setJarByClass(ClassCount.class);
- job.setJobName("classCount");
-
- job.setMapperClass(ClassMap.class);
- job.setReducerClass(Reduce.class);
-
- job.setInputFormatClass(TextInputFormat.class);
- job.setOutputFormatClass(TextOutputFormat.class);
-
- FileInputFormat.setInputPaths(job,new Path(args[0]));
- FileOutputFormat.setOutputPath(job,new Path(args[1]));
-
- boolean success = job.waitForCompletion(true);
- return success ? 0 : 1;
- }
- public static void main(String[] args) throws Exception
- {
- int ret = ToolRunner.run(new ClassCount(),args);
- System.exit(ret);
- }
- }
- </span>
复制代码
抛出如下异常:
- java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
复制代码
我以为输入的键是文本就用Text来作为key,但貌似不是这样子的,map方法把文件的行号当成key,所以要用LongWritable。
但是改过来之后,报了下面的异常:
- 14/04/25 17:21:15 INFO mapred.JobClient: Task Id : attempt_201404211802_0040_m_000000_1, Status : FAILED
- java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.IntWritable
复制代码
这个就更加直观了,需要在run方法中添加下面的两行以明确声明输入的格式。
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
复制代码
|