本帖最后由 坎蒂丝_Swan 于 2015-1-23 17:43 编辑
问题导读
1.本文使用的是什么框架? 2.IKAnalyzer与中文分词方法有什么不同?
与其不同的地方有: 0)其使用Hadoop Streaming,这里使用MapReduce框架。 2)这里的材料为《射雕英雄传》。哈哈,总要来一些改变。
0)使用WordCount源代码,修改其Map,在Map中使用IKAnalyzer的分词功能。
- import java.io.IOException;
- import java.io.InputStream;
- import java.io.InputStreamReader;
- import java.io.Reader;
- import java.io.ByteArrayInputStream;
-
- import org.wltea.analyzer.core.IKSegmenter;
- import org.wltea.analyzer.core.Lexeme;
-
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.util.GenericOptionsParser;
-
- public class ChineseWordCount {
-
- public static class TokenizerMapper
- extends Mapper<Object, Text, Text, IntWritable>{
-
- private final static IntWritable one = new IntWritable(1);
- private Text word = new Text();
-
- public void map(Object key, Text value, Context context
- ) throws IOException, InterruptedException {
-
- byte[] bt = value.getBytes();
- InputStream ip = new ByteArrayInputStream(bt);
- Reader read = new InputStreamReader(ip);
- IKSegmenter iks = new IKSegmenter(read,true);
- Lexeme t;
- while ((t = iks.next()) != null)
- {
- word.set(t.getLexemeText());
- context.write(word, one);
- }
- }
- }
-
- public static class IntSumReducer
- extends Reducer<Text,IntWritable,Text,IntWritable> {
- private IntWritable result = new IntWritable();
-
- public void reduce(Text key, Iterable<IntWritable> values,
- Context context
- ) throws IOException, InterruptedException {
- int sum = 0;
- for (IntWritable val : values) {
- sum += val.get();
- }
- result.set(sum);
- context.write(key, result);
- }
- }
-
- public static void main(String[] args) throws Exception {
- Configuration conf = new Configuration();
- String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
- if (otherArgs.length != 2) {
- System.err.println("Usage: wordcount <in> <out>");
- System.exit(2);
- }
- Job job = new Job(conf, "word count");
- job.setJarByClass(ChineseWordCount.class);
- job.setMapperClass(TokenizerMapper.class);
- job.setCombinerClass(IntSumReducer.class);
- job.setReducerClass(IntSumReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
- FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
- System.exit(job.waitForCompletion(true) ? 0 : 1);
- }
- }
复制代码
1)So,完成了,本地插件模拟环境OK。打包(带上分词包)扔到集群上。 - hadoop fs -put chinese_in.txt chinese_in.txt
- hadoop jar WordCount.jar chinese_in.txt out0
-
- ...mapping reducing...
-
- hadoop fs -ls ./out0
- hadoop fs -get part-r-00000 words.txt
复制代码
2)数据后处理: 2.1)数据排序 - head words.txt
- tail words.txt
-
-
- sort -k2 words.txt >0.txt
- head 0.txt
- tail 0.txt
- sort -k2r words.txt>0.txt
- head 0.txt
- tail 0.txt
- sort -k2rn words.txt>0.txt
- head -n 50 0.txt
复制代码
2.2)目标提取 - awk '{if(length($1)>=2) print $0}' 0.txt >1.txt
复制代码
2.3)结果呈现 - head 1.txt -n 50 | sed = | sed 'N;s/\n//'
复制代码
- 1郭靖 6427
- 2黄蓉 4621
- 3欧阳 1660
- 4甚么 1430
- 5说道 1287
- 6洪七公 1225
- 7笑道 1214
- 8自己 1193
- 9一个 1160
- 10师父 1080
- 11黄药师 1059
- 12心中 1046
- 13两人 1016
- 14武功 950
- 15咱们 925
- 16一声 912
- 17只见 827
- 18他们 782
- 19心想 780
- 20周伯通 771
- 21功夫 758
- 22不知 755
- 23欧阳克 752
- 24听得 741
- 25丘处机 732
- 26当下 668
- 27爹爹 664
- 28只是 657
- 29知道 654
- 30这时 639
- 31之中 621
- 32梅超风 586
- 33身子 552
- 34都是 540
- 35不是 534
- 36如此 531
- 37柯镇恶 528
- 38到了 523
- 39不敢 522
- 40裘千仞 521
- 41杨康 520
- 42你们 509
- 43这一 495
- 44却是 478
- 45众人 476
- 46二人 475
- 47铁木真 469
- 48怎么 464
- 49左手 452
- 50地下 448
复制代码
在非人名词中有很多很有意思,如:5说道7笑道12心中17只见22不知30这时49左手。
|