hadoop开发需要注意的几个问题

问题导读：
1.Hadoop开发中如何设置编码，你了解有几种？
2.mapredue为什么要进行压缩？
3.reduce个数如何设置才最合适？

Hadoop版本不断升级，但是有时候，我们依然会遇到下面问题。

1 中文问题
从url中解析出中文,但hadoop中打印出来仍是乱码?我们曾经以为hadoop是不支持中文的，后来经过查看源代码，发现hadoop仅仅是不支持以gbk格式输出中文而己。

这是TextOutputFormat.class中的代码，hadoop默认的输出都是继承自FileOutputFormat来的，FileOutputFormat的两个子类一个是基于二进制流的输出，一个就是基于文本的输出TextOutputFormat。

public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
  protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
private static final String utf8 = “UTF-8″;//这里被写死成了utf-8
private static final byte[] newline;
static {
   try {
      newline = “\n”.getBytes(utf8);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + utf8 + ” encoding”);
   }
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
   this.out = out;
   try {
      this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + utf8 + ” encoding”);
   }
}
…
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
      Text to = (Text) o;
      out.write(to.getBytes(), 0, to.getLength());//这里也需要修改
   } else {
      out.write(o.toString().getBytes(utf8));
   }
}
…
}
可以看出hadoop默认的输出写死为utf-8，因此如果decode中文正确，那么将Linux客户端的character设为utf-8是可以看到中文的。因为hadoop用utf-8的格式输出了中文。
因为大多数数据库是用gbk来定义字段的，如果想让hadoop用gbk格式输出中文以兼容数据库怎么办？
我们可以定义一个新的类：
public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
  protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
//写成gbk即可
private static final String gbk = “gbk”;
private static final byte[] newline;
static {
   try {
      newline = “\n”.getBytes(gbk);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + gbk + ” encoding”);
   }
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
   this.out = out;
   try {
      this.keyValueSeparator = keyValueSeparator.getBytes(gbk);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + gbk + ” encoding”);
   }
}
…
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
//       Text to = (Text) o;
//       out.write(to.getBytes(), 0, to.getLength());
//    } else {
      out.write(o.toString().getBytes(gbk));
   }
}
…
}
然后在mapreduce代码中加入conf1.setOutputFormat(GbkOutputFormat.class)
即可以gbk格式输出中文。

注释：后面版本升级，这个出现问题的概率不多了，在编程过程中需要注意编码问题，Hadoop开发编码一致，最好是utf-8

2 关于计算过程中的压缩和效率的对比问题
之前曾经介绍过对输入文件采用压缩可以提高部分计算效率。现在作更进一步的说明。

为什么压缩会提高计算速度？这是因为mapreduce计算会将数据文件分散拷贝到所有datanode上，压缩可以减少数据浪费在带宽上的时间，当这些时间大于压缩/解压缩本身的时间时，计算速度就会提高了。

hadoop的压缩除了将输入文件进行压缩外，hadoop本身还可以在计算过程中将map输出以及将reduce输出进行压缩。这种计算当中的压缩又有什么样的效果呢？

测试环境：35台节点的hadoop cluster，单机2 CPU,8 core,8G内存，redhat 2.6.9, 其中namenode和second namenode各一台，namenode和second namenode不作datanode

输入文件大小为2.5G不压缩，records约为3600万条。mapreduce程序分为两个job:
job1:map将record按user字段作key拆分，reduce中作外连接。这样最后reduce输出为87亿records，大小540G
job2:map读入这87亿条数据并输出，reduce进行简单统计，最后的records为2.5亿条，大小16G
计算耗时54min

仅对第二个阶段的map作压缩(第一个阶段的map输出并不大，没有压缩的必要)，测试结果：计算耗时39min

可见时间上节约了15min，注意以下参数的不同。
不压缩时:
   Local bytes read=1923047905109
   Local bytes written=1685607947227
   压缩时：
   Local bytes read=770579526349
   Local bytes written=245469534966
   本地读写的的数量大大降低了

   至于对reduce输出的压缩，很遗憾经过测试基本没有提高速度的效果。可能是因为第一个job的输出大多数是在本地机上进行map，不经过网络传输的原因。
   附：对map输出进行压缩，只需要添加jobConf.setMapOutputCompressorClass(DefaultCodec.class)

3 关于reduce的数量设置问题
reduce数量究竟多少是适合的。目前测试认为reduce数量约等于cluster中datanode的总cores的一半比较合适，比如cluster中有32台datanode,每台8 core，那么reduce设置为128速度最快。因为每台机器8 core，4个作map,4个作reduce计算，正好合适。
附小测试：对同一个程序
         reduce num=32,reduce time = 6 min
         reduce num=128, reduce time = 2 min
         reduce num=320, reduce time = 5min

图文精华

hadoop开发需要注意的几个问题

活跃会员

热心会员

推广达人

宣传达人

突出贡献

优秀版主

论坛元老

推荐 /2