实现可以参考下面内容:
文件的压缩有两大好处:1、可以减少存储文件所需要的磁盘空间;2、可以加速数据在网络和磁盘上的传输。尤其是在处理大数据时,这两大好处是相当重要的。 下面是一个使用gzip工具压缩文件的例子。将文件/user/hadoop/aa.txt进行压缩,压缩后为/user/hadoop/text.gz
- package com.hdfs;
-
- import java.io.IOException;
- import java.io.InputStream;
- import java.io.OutputStream;
- import java.net.URI;
-
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FSDataInputStream;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IOUtils;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.CompressionCodecFactory;
- import org.apache.hadoop.io.compress.CompressionInputStream;
- import org.apache.hadoop.io.compress.CompressionOutputStream;
- import org.apache.hadoop.util.ReflectionUtils;
-
- public class CodecTest {
- //压缩文件
- public static void compress(String codecClassName) throws Exception{
- Class<?> codecClass = Class.forName(codecClassName);
- Configuration conf = new Configuration();
- FileSystem fs = FileSystem.get(conf);
- CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
- //指定压缩文件路径
- FSDataOutputStream outputStream = fs.create(new Path("/user/hadoop/text.gz"));
- //指定要被压缩的文件路径
- FSDataInputStream in = fs.open(new Path("/user/hadoop/aa.txt"));
- //创建压缩输出流
- CompressionOutputStream out = codec.createOutputStream(outputStream);
- IOUtils.copyBytes(in, out, conf);
- IOUtils.closeStream(in);
- IOUtils.closeStream(out);
- }
-
- //解压缩
- public static void uncompress(String fileName) throws Exception{
- Class<?> codecClass = Class.forName("org.apache.hadoop.io.compress.GzipCodec");
- Configuration conf = new Configuration();
- FileSystem fs = FileSystem.get(conf);
- CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
- FSDataInputStream inputStream = fs.open(new Path("/user/hadoop/text.gz"));
- //把text文件里到数据解压,然后输出到控制台
- InputStream in = codec.createInputStream(inputStream);
- IOUtils.copyBytes(in, System.out, conf);
- IOUtils.closeStream(in);
- }
-
- //使用文件扩展名来推断二来的codec来对文件进行解压缩
- public static void uncompress1(String uri) throws IOException{
- Configuration conf = new Configuration();
- FileSystem fs = FileSystem.get(URI.create(uri), conf);
-
- Path inputPath = new Path(uri);
- CompressionCodecFactory factory = new CompressionCodecFactory(conf);
- CompressionCodec codec = factory.getCodec(inputPath);
- if(codec == null){
- System.out.println("no codec found for " + uri);
- System.exit(1);
- }
- String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
- InputStream in = null;
- OutputStream out = null;
- try {
- in = codec.createInputStream(fs.open(inputPath));
- out = fs.create(new Path(outputUri));
- IOUtils.copyBytes(in, out, conf);
- } finally{
- IOUtils.closeStream(out);
- IOUtils.closeStream(in);
- }
- }
-
- public static void main(String[] args) throws Exception {
- //compress("org.apache.hadoop.io.compress.GzipCodec");
- //uncompress("text");
- uncompress1("hdfs://master:9000/user/hadoop/text.gz");
- }
-
- }
复制代码
首先执行77行进行压缩,压缩后执行第78行进行解压缩,这里解压到标准输出,所以执行78行会再控制台看到文件/user/hadoop/aa.txt的内容。如果执行79行的话会将文件解压到/user/hadoop/text,他是根据/user/hadoop/text.gz的扩展名判断使用哪个解压工具进行解压的。解压后的路径就是去掉扩展名。 进行文件压缩后,在执行命令./hadoop fs -ls /user/hadoop/查看文件信息,如下:- [hadoop@master bin]$ ./hadoop fs -ls /user/hadoop/
- Found 7 items
- -rw-r--r-- 3 hadoop supergroup 76805248 2013-06-17 23:55 /user/hadoop/aa.mp4
- -rw-r--r-- 3 hadoop supergroup 520 2013-06-17 22:29 /user/hadoop/aa.txt
- drwxr-xr-x - hadoop supergroup 0 2013-06-16 17:19 /user/hadoop/input
- drwxr-xr-x - hadoop supergroup 0 2013-06-16 19:32 /user/hadoop/output
- drwxr-xr-x - hadoop supergroup 0 2013-06-18 17:08 /user/hadoop/test
- drwxr-xr-x - hadoop supergroup 0 2013-06-18 19:45 /user/hadoop/test1
- -rw-r--r-- 3 hadoop supergroup 46 2013-06-19 20:09 /user/hadoop/text.gz
复制代码
第4行为压缩之前的文件,大小为520个字节。第9行为压缩后的文件,大小为46个字节。由此可以看出上面讲的压缩的两大好处了。
|