hadoop重写recordwrite时对text类型的处理

再重写recordwrite方法时， out.write(to.getBytes(), 0, to.getLength());和out.write(o.toString().getBytes(utf8));

有是没区别?为什么分开处理

private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
      Text to = (Text) o;
      out.write(to.getBytes(), 0, to.getLength());
   } else {
      out.write(o.toString().getBytes(utf8));
   }
}

easthome001 · 发表于 2016-12-16 17:33:54

再重写recordwrite方法时， out.write(to.getBytes(), 0, to.getLength());和out.write(o.toString().getBytes(utf8));
有是没区别?为什么分开处理
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
      Text to = (Text) o;
      out.write(to.getBytes(), 0, to.getLength());
   } else {
      out.write(o.toString().getBytes(utf8));
   }
}

为什么分开，要看你们的业务逻辑。
这里只是单纯从程序角度来讲：
输出数据分为两个类型：
一个文本类型，和非文本类型。
文本类型，直接输出
非文本类型，转换为字符串，并且以utf8编码的方式输出

shfshihuafeng · 发表于 2016-12-16 17:58:30

@easthome001：你好

我在工作中：用mr向hdfs输出数据，启用3个reduce。我的value是字符串类型，是new text（value）传入自定义的recordwrite里我直接 out.write(o.toString().getBytes()）;但是输入到hdfs为空。我在 out.write(o.toString().getBytes()）前打印日志 system.out（value.tostring()）是有数据的呀，到现在没找到空的原因：代码
public class InputEdgeInReplaceIDFormat <K, V> extends FileOutputFormat<K, V>{

public static final Logger LOG = Logger.getLogger(InputEdgeInReplaceIDFormat.class);

public static class ReplaceRecordWriter<K, V> extends RecordWriter<K, V> {

private DataOutputStream fileWrite;

public ReplaceRecordWriter(Configuration conf, DataOutputStream out)
throws IOException {

this.fileWrite = out;

}

@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
// TODO Auto-generated method stub

}

@Override
public void write(K key, V value) throws IOException, InterruptedException {
// TODO Auto-generated method stub
System.out.println(value.toString()+"\n");
// 开始写数据
fileWrite.write((value.toString()+"\n").getBytes());
}
}

@Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
// TODO Auto-generated method stub

// 创建输出管道
Configuration conf = context.getConfiguration();

Path file = getDefaultWorkFile(context, "");

      FileSystem fs = file.getFileSystem(conf);

      FSDataOutputStream fileOut = fs.create(file, false);

return new ReplaceRecordWriter<K, V>(conf,fileOut);
}

easthome001 · 发表于 2016-12-16 18:08:59

shfshihuafeng 发表于 2016-12-16 17:58
@easthome001：你好

我在工作中：用mr向hdfs输出数据，启用3个reduce。我的value是字符串类型，是new te ...

hadoop数据类型和Java的数据类型，楼主要对应上。
hadoop的text其实就是Java的字符串类型。
看下面程序，采用的应该是下面红字部分输出
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
      Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
   } else {
      out.write(o.toString().getBytes(utf8));
   }
}

##############################
下面是数据类型对应关系

(2)基本类型(hadoop:java)：
数据类型                   hadoop数据类型：                                  Java数据类型

布尔型                   *BooleanWritable                                        boolean
整型                      *IntWritable：                                                 int
浮点float                *FloatWritable：                                              float
浮点型double          *DoubleWritable：                                           double
整数类型byte          *ByteWritable：                                              byte
这里说明一下，hadoop数据库类型与Java数据类型之间如何转换：
有两种方式
1.通过set方式
2.通过new的方式。

(3)其它(部分)：
*Text：hadoop：中对应Java数据类型string
*ArrayWritable：  中对应Java数据类型数组。
来自：
hadoop编程基础：数据类型介绍及与Java数据类型之间转换
http://www.aboutyun.com/forum.php?mod=viewthread&tid=7036