Hadoop DistributedCache使用及原理

本帖最后由 desehawk 于 2014-6-28 12:45 编辑

问题导读：

1.DistributedCache是什么？
2.每个存储在HDFS中的文件被放到缓存中后有什么特征？
3.distributedCache可以分发什么类型的文件？
4.distributedCache通过什么设置来分发文件，如何设置分发多个文件？

概览

DistributedCache 是一个提供给Map/Reduce框架的工具，用来缓存文件（text, archives, jars and so on）文件的默认访问协议为(hdfs://).

DistributedCache将拷贝缓存的文件到Slave节点在任何Job在节点上执行之前。

文件在每个Job中只会被拷贝一次，缓存的归档文件会被在Slave节点中解压缩。

符号链接

每个存储在HDFS中的文件被放到缓存中后都可以通过一个符号链接使用。

URI hdfs://namenode/test/input/file1#myfile 你可以在程序中直接使用myfile来访问 file1这个文件。 myfile是一个符号链接文件。

缓存在本地的存储目录

<property>
  <name>mapred.local.dir</name>
  <value>${hadoop.tmp.dir}/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>
<property>
  <name>local.cache.size</name>
  <value>10737418240</value> （默认大小：10GB）
  <description>The limit on the size of cache you want to keep, set by default
  to 10GB. This will act as a soft limit on the cache directory for out of band data.
  </description>
</property>
复制代码

实际在DataNode节点中的存储目录：

/netqin/hadoop/tmp{${hadoop.tmp.dir}}/mapred/local/taskTracker/archive/hadoop-server01{NameNode主机名称}

Archive文件会被解压缩

例子

package com.netqin.examples;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class CacheDemo {
    public static void UseDistributedCacheBySymbolicLink() throws Exception {
        FileReader reader = new FileReader("hdfs://mail.py");
        BufferedReader br = new BufferedReader(reader);
        String s = null;
        while ((s = br.readLine()) != null) {
            System.out.println(s);
        }
        br.close();
        reader.close();
    }
    public static class TokenizerMapper extends
            Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        protected void setup(Context context) throws IOException,
                InterruptedException {
            System.out.println("Now, use the distributed cache and syslink");
            try {
                UseDistributedCacheBySymbolicLink();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }
    public static class IntSumReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        DistributedCache.createSymlink(conf);
        String path = "/tmp/test/mail.py";
        Path filePath = new Path(path);
        String uriWithLink = filePath.toUri().toString() + "#" + "mail.py";
        DistributedCache.addCacheFile(new URI(uriWithLink), conf);
      
        // Path p = new Path("/tmp/hadoop-0.20.2-capacity-scheduler.jar#hadoop-0.20.2-capacity-scheduler.jar");
        // DistributedCache.addArchiveToClassPath(p, conf);
      
      
        Job job = new Job(conf, "CacheDemo");
        job.setJarByClass(CacheDemo.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
复制代码

DistributedCache

DistributedCache 可将具体应用相关的、大尺寸的、只读的文件有效地分布放置。

DistributedCache 是Map/Reduce框架提供的功能，能够缓存应用程序所需的文件（包括文本，档案文件，jar文件等）。

应用程序在JobConf中通过url(hdfs://)指定需要被缓存的文件。 DistributedCache假定由hdfs://格式url指定的文件已经在 FileSystem上了。

Map-Redcue框架在作业所有任务执行之前会把必要的文件拷贝到slave节点上。它运行高效是因为每个作业的文件只拷贝一次并且为那些没有文档的slave节点缓存文档。

DistributedCache 根据缓存文档修改的时间戳进行追踪。在作业执行期间，当前应用程序或者外部程序不能修改缓存文件。

distributedCache可以分发简单的只读数据或文本文件，也可以分发复杂类型的文件例如归档文件和jar文件。归档文件(zip,tar,tgz和tar.gz文件)在slave节点上会被解档（un-archived）。这些文件可以设置执行权限。

用户可以通过设置mapred.cache.{files|archives}来分发文件。如果要分发多个文件，可以使用逗号分隔文件所在路径。也可以利用API来设置该属性： DistributedCache.addCacheFile(URI,conf)/ DistributedCache.addCacheArchive(URI,conf) and DistributedCache.setCacheFiles(URIs,conf)/ DistributedCache.setCacheArchives(URIs,conf)

其中URI的形式是 hdfs://host:port/absolute-path#link-name 在Streaming程序中，可以通过命令行选项 -cacheFile/-cacheArchive 分发文件。

用户可以通过DistributedCache.createSymlink(Configuration)方法让DistributedCache 在当前工作目录下创建到缓存文件的符号链接。或者通过设置配置文件属性mapred.create.symlink为yes。分布式缓存会截取URI的片段作为链接的名字。例如，URI是 hdfs://namenode:port/lib.so.1#lib.so，则在task当前工作目录会有名为lib.so的链接，它会链接分布式缓存中的lib.so.1。

DistributedCache可在map/reduce任务中作为一种基础软件分发机制使用。它可以被用于分发jar包和本地库（native libraries）。 DistributedCache.addArchiveToClassPath(Path, Configuration)和 DistributedCache.addFileToClassPath(Path, Configuration) API能够被用于缓存文件和jar包，并把它们加入子jvm的classpath。也可以通过设置配置文档里的属性 mapred.job.classpath.{files|archives}达到相同的效果。缓存文件可用于分发和装载本地库。

http://www.open-open.com/lib/view/open1337349822015.html

Hadoop有一个叫做分布式缓存(distributed cache)的机制来将数据分发到集群上的所有节点上。为了节约网络带宽，在每一个作业中，各个文件通常只需要复制到一个节点一次。
缓存文件复制位置：mapred-site.xml中

<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/tmp</value>
</property>
复制代码

操作步骤：
1.将数据的分发到每个节点上：
DistributedCache.addCacheFile(new URI("hdfs://cloud01:9000/user/hadoop/mrinput/ST.txt"), conf);
注意，此操作一定要在创建Job，将conf传递给Job之前进行，否则数据文件的路径不会被Mapper中取到。
2.在每个Mapper中获取文件URI，再进行相关操作：
URI[] uris=DistributedCache.getCacheFiles(context.getConfiguration());

比如读取该文件：
FileSystem fs = FileSystem.get(URI.create("hdfs://cloud01:9000"), context.getConfiguration());
　　FSDataInputStream in = null;
　　in = fs.open(new Path(uris[0].getPath()));
　　BufferedReader br=new BufferedReader(new InputStreamReader(in));

hadoop中的DistributedCache 2
WordCount.javaHadoop的分布式缓存机制使得一个job的所有map或reduce可以访问同一份文件。在任务提交后，hadoop将由-files和-archive选项指定的文件复制到HDFS上（JobTracker的文件系统）。在任务运行前，TaskTracker从JobTracker文件系统复制文件到本地磁盘作为缓存，这样任务就可以访问这些文件。对于job来说，它并不关心文件是从哪儿来的。在使用DistributedCache时，对于本地化文件的访问，通常使用Symbolic Link来访问，这样更方便。通过 URI hdfs://namenode/test/input/file1#myfile 指定的文件在当前工作目录中被符号链接为myfile。这样job里面可直接通过myfile来访问文件，而不用关心该文件在本地的具体路径。
示例如下：

package org.myorg;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.StringTokenizer;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount
{
public static void UseDistributedCacheBySymbolicLink() throws Exception
{
FileReader reader = new FileReader("god.txt");
BufferedReader br = new BufferedReader(reader);
String s1 = null;
while ((s1 = br.readLine()) != null)
{
System.out.println(s1);
}
br.close();
reader.close();
}


public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{

public void configure(JobConf job)
{
System.out.println("Now, use the distributed cache and syslink");
try {
UseDistributedCacheBySymbolicLink();
}
catch (Exception e)
{
e.printStackTrace();
}

}

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

DistributedCache.createSymlink(conf);
String path = "/xuxm_dev_test_61_pic/in/WordCount.java";
Path filePath = new Path(path);
String uriWithLink = filePath.toUri().toString() + "#" + "god.txt";
DistributedCache.addCacheFile(new URI(uriWithLink), conf);

JobClient.runJob(conf);
}
}
复制代码

执行方法参考http://hadoop.apache.org/common/ ... BC%9AWordCount+v1.0

　　程序运行的结果是在jobtracker中的task的log可以看到打印后的/xuxm_dev_test_61_pic/in/WordCount.java文件的内容。

　　如果程序中要用到很多小文件，那么使用Symbolic Link将非常方便。

请在执行前先将WordCount.java文件放到指定位置,否则就会找不到文件

概念：
reduce-side join技术是灵活的，但是有时候它仍然会变得效率极低。由于join直到reduce()阶段才会开始，我们将会在网络中传递shuffle所有数据，而在大多数情况下，我们会在join阶段丢掉大多数传递的数据。因此我们期望能够在map阶段完成整个join操作。

主要技术难点：

在map阶段完成join的主要困难就是mapper可能需要与一个它自己不能获得的数据进行join操作，如果我们能够保证这样子的数据可被mapper获得，那我们这个技术就可用。举个例子，如果我们知道两个源数据被分为同样大小的partition，而且每个partition都以适合作为join key的key值排序的话，那每个mapper()就可以获取所有join操作需要的数据。事实上，Hadoop的org.apache.hadoop.mared.join包中包含了这样的帮助类来实现mapside join，但不幸的是，这样的情况太少了。而且使用这样的类会造成额外的开销。因此，我们不会继续讨论这个包。

什么情况下使用？

情况1：如果我们知道两个源数据被分为同样大小的partition，而且每个partition都以适合作为join key的key值排序

情况2：当join大型数据时，通常只有一个源数据十分巨大，另一个数据可能就会呈数量级的减小。例如，一个电话公司的用户数据可能只有千万条用户数据，但他的交易记录数据可能会有十亿条数量级以上的具体电话记录。当小的数据源可以被分配到mapper的内存中时，我们可以获得效果明显的性能提高，只要通过将小的数据源拷贝到每一台mapper机器上，使mapper在map阶段就进行join操作。这个操作就叫做replicate join。

解决方案：

Hadoop有一个叫做分布式缓存(distributed cache)的机制来将数据分发到集群上的所有节点上。它通常用来分发所有mapper需要的包含“background”数据的文件。例如你使用Hadoop来分类文档，你可能会有一个关键字的列表，你将使用distributed cache来保证所有mapper能够获得这些keywords（"background data"）。
操作步骤：

1.将数据分发到每个节点上：

DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
复制代码

2.在每个mapper上使用DistributedCache.getLocalCacheFiles()来获取文件，之后再进行相应的操作：

DistributedCache.getLocalCacheFiles(); 
复制代码

新出现的问题：

我们的又一个限制是我们其中一个join的表必须足够小以至于能保存到内存中。尽管在不对称大小的输入数据中，较小的那个数据可能仍然不够小（不够小到可以放入内存中。）
1.我们可以通过重新安排数据加工步骤来使它们有效。例如：如果你需要一个所有用户在415区的排序数据时，在滤除一定记录前就将Orders以及Customers表连接起来虽然正确，但是效率却不高。Customers和Orders表都可能大到不能放入内存中。此时我们可以预处理数据使Customers或者Orders表变小。
2.有时候我们不论怎样预处理数据都不能使数据足够小，那我们应该在map时过滤掉不属于415 area的用户。详见《Hadoop in Action》 Chapter5.2.3 semijoin

图文精华

Hadoop DistributedCache使用及原理

活跃会员

热心会员

优秀版主

推荐 /2