分享

配置Hadoop与Hive使用LZO压缩

本帖最后由 sunshine_junge 于 2014-12-28 20:24 编辑


问题导读:
1.hadoop如何使用LZO?
2.hive如何使用LZO?





安装LZO压缩工具

LZO
  1. wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.08.tar.gz
  2. tar -zxvf lzo-2.08.tar.gz
  3. cd lzo-2.08
  4. export CFLAGS=-m64
  5. ./configure -enable-shared -prefix=/usr/local/cloud/hadoop/lzo/
  6. make && make install
  7. cp /usr/local/cloud/hadoop/lzo/lib/* /usr/lib/
  8. cp /usr/local/cloud/hadoop/lzo/lib/* /usr/lib64/
  9. cp -r /usr/local/cloud/hadoop/lzo/include/* /usr/include/
复制代码


LZOP
  1. wget http://www.lzop.org/download/lzop-1.03.tar.gz
  2. tar -zxvf lzop-1.03.tar.gz
  3. cd lzop-1.03
  4. ./configure -enable-shared -prefix=/usr/local/cloud/hadoop/lzop/
  5. make && make install
  6. cd /usr/bin
  7. ln -s -f /usr/local/cloud/hadoop/lzop/bin/lzop
复制代码


测试
  1. history > history.log
  2. lzop history.log
复制代码
看到有history.log.lzo文件生成则lzo安装完成。



安装Hadoop-LZO
  1. <font size="2">git clone https://github.com/twitter/hadoop-lzo.git
  2. export CFLAGS=-m64
  3. export CXXFLAGS=-m64
  4. export C_INCLUDE_PATH=/usr/local/cloud/hadoop/lzo/include
  5. export LIBRARY_PATH=/usr/local/cloud/hadoop/lzo/lib
  6. mvn clean package -Dmaven.test.skip=true
  7. cp -r target/native/Linux-amd64-64 /usr/local/cloud/hadoop/lib/native/
  8. cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar /usr/local/cloud/hadoop/share/hadoop/common/</font>
复制代码

这里需要注意, 可以修改pom.xml来调整自己的hadoop版本, 找到hadoop.current.version配置项进行修改, 同时因为要添cloudera的仓库地址
  1. <repositories>
  2.     <repository>
  3.         <id>cloudera</id>
  4.         <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  5.     </repository>
  6. </repositories>
复制代码

修改配置
  1. <font size="2"># 添加如下配置项
  2. export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
  3. export LD_LIBRARY_PATH=$HADOOP_HOME/lzo/lib</font>
复制代码
  1. <property>
  2.     <name>io.compression.codecs</name>
  3.     <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
  4. </property>
  5. <property>
  6.     <name>io.compression.codec.lzo.class</name>
  7.     <value>com.hadoop.compression.lzo.LzoCodec</value>
  8. </property>
复制代码
  1. <font size="2" style="font-weight: normal;"><property>
  2.     <name>mapred.compress.map.output</name>
  3.     <value>true</value>
  4. </property>
  5. <property>
  6.     <name>mapred.map.output.compression.codec</name>
  7.     <value>com.hadoop.compression.lzo.LzoCodec</value>
  8. </property>
  9. <property>
  10.     <name>mapred.child.env</name>
  11.     <value>LD_LIBRARY_PATH=/usr/local/cloud/hadoop/lzo/lib</value>
  12. </property></font>
复制代码


Hive

hive使用lzo格式的文件需要在建表时指定格式
  1. create table log_lzo (
  2.   line string comment 'text line')
  3. partitioned by (logdate string comment 'log file time,format-yyyyMMdd')
  4. STORED AS
  5.   INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
  6.   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
复制代码

之后可以在此基础上解析文件存储为RCFile格式

  1. create table log_rcfile (
  2.   `ip` string COMMENT 'ip',
  3.   `timestamp` string COMMENT 'timestamp',
  4.   `url` string COMMENT 'ip')
  5. PARTITIONED BY (
  6.   logdate string comment 'log file time,format-yyyyMMdd')
  7. STORED AS RCFILE;
  8. insert overwrite table log_rcfile partition(logdate='20140625')
  9. select
  10.   array[0] as `ip`,
  11.   array[1] as `timestamp`,
  12.   array[2] as `url`
  13. from (
  14.   select
  15.     split(line, '#\\|~') as array
  16.   from log_lzo
  17.   where
  18.     1 = 1
  19.     and logdate = '20140625'
  20. )t;
复制代码






欢迎加入about云群90371779322273151432264021 ,云计算爱好者群,亦可关注about云腾讯认证空间||关注本站微信

没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条