本帖最后由 sunshine_junge 于 2014-12-28 20:24 编辑
问题导读:
1.hadoop如何使用LZO?
2.hive如何使用LZO?
安装LZO压缩工具
LZO
- wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.08.tar.gz
- tar -zxvf lzo-2.08.tar.gz
- cd lzo-2.08
- export CFLAGS=-m64
- ./configure -enable-shared -prefix=/usr/local/cloud/hadoop/lzo/
- make && make install
- cp /usr/local/cloud/hadoop/lzo/lib/* /usr/lib/
- cp /usr/local/cloud/hadoop/lzo/lib/* /usr/lib64/
- cp -r /usr/local/cloud/hadoop/lzo/include/* /usr/include/
复制代码
LZOP- wget http://www.lzop.org/download/lzop-1.03.tar.gz
- tar -zxvf lzop-1.03.tar.gz
- cd lzop-1.03
- ./configure -enable-shared -prefix=/usr/local/cloud/hadoop/lzop/
- make && make install
- cd /usr/bin
- ln -s -f /usr/local/cloud/hadoop/lzop/bin/lzop
复制代码
测试- history > history.log
- lzop history.log
复制代码
看到有history.log.lzo文件生成则lzo安装完成。
安装Hadoop-LZO
- <font size="2">git clone https://github.com/twitter/hadoop-lzo.git
- export CFLAGS=-m64
- export CXXFLAGS=-m64
- export C_INCLUDE_PATH=/usr/local/cloud/hadoop/lzo/include
- export LIBRARY_PATH=/usr/local/cloud/hadoop/lzo/lib
- mvn clean package -Dmaven.test.skip=true
- cp -r target/native/Linux-amd64-64 /usr/local/cloud/hadoop/lib/native/
- cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar /usr/local/cloud/hadoop/share/hadoop/common/</font>
复制代码
这里需要注意, 可以修改pom.xml来调整自己的hadoop版本, 找到hadoop.current.version配置项进行修改, 同时因为要添cloudera的仓库地址
- <repositories>
- <repository>
- <id>cloudera</id>
- <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
- </repository>
- </repositories>
复制代码
修改配置
- <font size="2"># 添加如下配置项
- export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
- export LD_LIBRARY_PATH=$HADOOP_HOME/lzo/lib</font>
复制代码
- <property>
- <name>io.compression.codecs</name>
- <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
- </property>
- <property>
- <name>io.compression.codec.lzo.class</name>
- <value>com.hadoop.compression.lzo.LzoCodec</value>
- </property>
复制代码
- <font size="2" style="font-weight: normal;"><property>
- <name>mapred.compress.map.output</name>
- <value>true</value>
- </property>
- <property>
- <name>mapred.map.output.compression.codec</name>
- <value>com.hadoop.compression.lzo.LzoCodec</value>
- </property>
- <property>
- <name>mapred.child.env</name>
- <value>LD_LIBRARY_PATH=/usr/local/cloud/hadoop/lzo/lib</value>
- </property></font>
复制代码
Hive
hive使用lzo格式的文件需要在建表时指定格式 - create table log_lzo (
- line string comment 'text line')
- partitioned by (logdate string comment 'log file time,format-yyyyMMdd')
- STORED AS
- INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
- OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
复制代码
之后可以在此基础上解析文件存储为RCFile格式
- create table log_rcfile (
- `ip` string COMMENT 'ip',
- `timestamp` string COMMENT 'timestamp',
- `url` string COMMENT 'ip')
- PARTITIONED BY (
- logdate string comment 'log file time,format-yyyyMMdd')
- STORED AS RCFILE;
-
- insert overwrite table log_rcfile partition(logdate='20140625')
- select
- array[0] as `ip`,
- array[1] as `timestamp`,
- array[2] as `url`
- from (
- select
- split(line, '#\\|~') as array
- from log_lzo
- where
- 1 = 1
- and logdate = '20140625'
- )t;
复制代码
|