如何让传统程序转换成mapreduce

问题导读：
1.传统程序有什么特点？
2.mapreduce如何实现的分布式？
3.是否所有的传统程序都可以转换为mapreduce?

传统程序我们求平均值、排序或许对于我们程序员来讲，这并不是难事。传统程序该如何转换mapreduce，这里以求平均数为例。

一、传统程序：
无论是C语言、Java、.net还是其它语言，比如求平均值。
我们都会是输入一组数据：

1
2
3
4
5
复制代码

然后该如何求平均值：

（1+2+3+4+5）/5=3
复制代码

这样的在任何语言中，这都是小事一桩。

二、如何转换mapreduce程序

为什么mapreduce被称之为分布式编程，是因为它把输入数据进行了分割，然后每一个客户端处理一部分数据，最后在合并起来。求平均值，mapreduce首先分割输入数据
1
2
3
4
5
分割之后，发给map处理，map处理完毕送到reduce，这样就完成了mapredcue。而这个中间的分割的过程，则是传统程序所没有的。下面便是通过来mapreduce实现来运行平均值

首先我们进行map函数：
map函数就是对数据一个分割，但是在进行之前已经对数据进行了分割。

我们从下面结果来分析mapredue：

输出内容.png

上面结果map传递value中，可以得出，map函数被调用了5次，然后分别输出了strScore.
Reduce调用了一次。

附上下面程序：如果附加到个人项目中，首先需要
(1)创建包aboutyun.com
(2)然后有avg.txt文件
(3)修改成自己的hdfs路径

从上面我们看出，任何传统的程序都可以转换为mapreduce.

package aboutyun.com;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import java.util.Iterator;
import java.util.StringTokenizer;

public class pingjunzhi {

        static final String INPUT_PATH = "hdfs://master:8020/avg.txt";
        static final String OUT_PATH = "hdfs://master:8020/outPut/test";

        public static void main(String[] args) throws Exception {
                // 主类
                Configuration conf = new Configuration();

                final Job job = Job.getInstance(conf, mapreduce.class.getSimpleName());
                // final Job job = new Job(conf, mapreduce.class.getSimpleName());
                job.setNumReduceTasks(1);
                job.setJarByClass(mapreduce.class);
                // 寻找输入
                FileInputFormat.setInputPaths(job, INPUT_PATH);
                // 1.2对输入数据进行格式化处理的类
                job.setInputFormatClass(TextInputFormat.class);
                job.setMapperClass(MyMapper.class);

                // 1.2指定map输出类型<key,value>类型
                job.setMapOutputKeyClass(Text.class);
                job.setMapOutputValueClass(LongWritable.class);

                job.setReducerClass(MyReduce.class);

                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(LongWritable.class);
                // 指定输出路径
                FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
                // 指定输出的格式或则类
                job.setOutputFormatClass(TextOutputFormat.class);

                // 把作业提交
                job.waitForCompletion(true);

        }

        // map类
        static class MyMapper extends
                        Mapper<LongWritable, Text, Text, LongWritable> {
                protected void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {

                

                        String line = value.toString();
                        Counter countPrint = context.getCounter("Map输出传递Value", line);
                        countPrint.increment(1l);
                        // 将输入的数据首先按行进行分割

                        StringTokenizer tokenizerArticle = new StringTokenizer(line, "\n");

                        // 分别对每一行进行处理

                        while (tokenizerArticle.hasMoreElements()) {

                                // 每行按空格划分

                                StringTokenizer tokenizerLine = new StringTokenizer(
                                                tokenizerArticle.nextToken());

                          

                                String strScore = tokenizerLine.nextToken();// 个数部分
                                Counter countPrint1 = context.getCounter("Map中循环strScore", strScore);
                                countPrint1.increment(1l);
                                // Text name = new Text(strName);

                                int scoreInt = Integer.parseInt(strScore);

                                // 输出

                                context.write(new Text("avg"), new LongWritable(scoreInt));

                        }

                }

        }

        // reduce类
        static class MyReduce extends
                        Reducer<Text, LongWritable, Text, LongWritable> {
                @Override
                protected void reduce(Text k2, java.lang.Iterable<LongWritable> v2s,
                                Context ctx) throws java.io.IOException, InterruptedException {

                
                        long sum = 0;

                        long count = 0;

                        Iterator<LongWritable> iterator = v2s.iterator();

                        while (iterator.hasNext()) {

                                sum += iterator.next().get();// 计算总值

                                count++;// 统计个数

                        }

                        long average = (long) sum / count;// 计算平均值

                        ctx.write(k2, new LongWritable(average));
                        Counter countPrint1 = ctx.getCounter("Redue调用次数","空");
                        countPrint1.increment(1l);

                }

        }

}
复制代码

kartik · 发表于 2014-7-4 10:36:11

请问传统的下载文件程序如何转为mapreduce，有相关的例子吗？谢谢版主。

pig2 · 发表于 2014-7-4 11:05:31

本帖最后由 pig2 于 2014-7-4 11:23 编辑

kartik 发表于 2014-7-4 10:36
请问传统的下载文件程序如何转为mapreduce，有相关的例子吗？谢谢版主。

你这是批量下载吗？

如果是的话，可以把批量的数据进行map分割，然后放到不同的分区，最后reduce合并。

这里只是举例：例如你需要下载下面url

www.aboutyun.com
www.aboutyun.com/forum.php
http://www.aboutyun.com/group.php
......

我这里没有按照最优的方法，你可以在调整：

// map类
static class MyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
Counter countPrint = context.getCounter("Map输出传递url", line);
countPrint.increment(1l);
// 将输入的数据首先按行进行分割
StringTokenizer tokenizerArticle = new StringTokenizer(line, "\n");

    // 分别对每一行进行处理

   while (tokenizerArticle.hasMoreElements()) {

                               // 每行按空格划分
                                String url = tokenizerLine.nextToken();// 个数部分
                                Counter countPrint1 = context.getCounter("Map中循环url", url);
                                countPrint1.increment(1l);
                     
                                // 输出

                                context.write(new Text("avg"), new LongWritable(scoreInt));

                        }
   }
}
复制代码

这样上面完成url的分割，达到我们的分布式，传统程序是没有分割这一步的

下面是reduce的例子，不过是伪代码，你可以按照这个思路来

// reduce类
        static class MyReduce extends
                        Reducer<Text, LongWritable, Text, LongWritable> {
                @Override
                protected void reduce(Text k2, java.lang.Iterable<LongWritable> urls,
                                Context ctx) throws java.io.IOException, InterruptedException {
                        
                        long times = 0L;
                        for (LongWritable url : urls) {
                        
                                //这里写上下载的业务逻辑
                               if（“成功”）
                                 {
                                ctx.write(url+“成功”, new LongWritable(1L));
                                 }
                              else
                                {
                                     ctx.write(url+“失败”, new LongWritable(1L));
                                 }
                        }
                
                        

                }

                
        }
复制代码

pig2 · 发表于 2014-7-4 11:18:41

驱动函数都差不多，可以不用调整，这里面的关键是map进行了分割，没有对业务逻辑做实质性的事情，可以把业务逻辑放到reduce中。

pig2 · 发表于 2014-7-4 11:18:42

驱动函数都差不多，可以不用调整，这里面的关键是map进行了分割，没有对业务逻辑做实质性的事情，可以把业务逻辑放到reduce中。

kartik · 发表于 2014-7-4 11:38:22

pig2 发表于 2014-7-4 11:18
驱动函数都差不多，可以不用调整，这里面的关键是map进行了分割，没有对业务逻辑做实质性的事情，可以把业 ...

好的，我先了解看看。谢谢你哈。

kartik · 发表于 2014-7-4 11:43:14

pig2 发表于 2014-7-4 11:18
驱动函数都差不多，可以不用调整，这里面的关键是map进行了分割，没有对业务逻辑做实质性的事情，可以把业 ...

我的是下载单个URL里面的数据，涉及多线程，断点续传，可否让一个map对应一个数据块

pig2 · 发表于 2014-7-4 11:59:53

kartik 发表于 2014-7-4 11:43
我的是下载单个URL里面的数据，涉及多线程，断点续传，可否让一个map对应一个数据块

你只要了解，map是用来划分、分割的，如同我们干活，以前我们都是用一台机器干活，这就传统程序。现在我们可以使用多台机器了，那么就可以使用多台了。
因此你的下载划分是关键

这里面你可以划分数据块，也可以划分线程。这里只是提供思路，可行性你还需要进一步研究

图文精华

如何让传统程序转换成mapreduce

已有(7)人评论

活跃会员

热心会员

优秀版主

论坛元老

推荐 /2