问题导读
1.将hive表转换为orc格式有什么优点?
2.如何将格式表转换?
3.转换后,效率有什么变化?
将hive表转换为orc格式存储,这种格式一方面能够大幅降低磁盘占用,另一方面能够大幅提高执行效率。
将hive表转换为orc格式表之前,需要分两步执行。
首先需要根据hive表创建orc格式的hive表,列与列直接对应上:
[mw_shl_code=bash,true]create table sp_orc.heartbeat(
id String,
name String
)
partitioned by (date string)
stored as orc;[/mw_shl_code]
然后就可以将textfile格式的源hive表分区导入orc表中:
[mw_shl_code=bash,true]insert into table sp_orc.heartbeat partition (date='2014-09-01')
select id, name from sp_logs.heartbeat where date='2014-09-01';[/mw_shl_code]
需要注意的是,如果源表与orc表结构完全相同,那么以上sql中的子查询就可以直接采用select *了。
简单测试一下lzo压缩带来的性能变化。
原始数据原始数据有23690132条,查询如下: [mw_shl_code=bash,true]hive> select count(\*) from heartbeat where date='2014-09-01';
Total MapReduce CPU Time Spent: 1 minutes 28 seconds 730 msec
OK
23690132
Time taken: 22.8 seconds, Fetched: 1 row(s)
hive> select id, max(power) from sp_logs.heartbeat where date='2014-09-01' group by id;
......
Time taken: 24.152 seconds, Fetched: 14213 row(s)[/mw_shl_code]
转换为ORC存储
可以直接在hive中执行insert into,将textfile格式的源数据表导入orc格式的新表,要注意insert into时的新旧表列对应。
[mw_shl_code=bash,true]hive> insert into table sp_orc.heartbeat partition (date='2014-09-01')
select id,power,code,ver,oper,pwd,suc,card,p,ip,time from sp_logs.heartbeat where date='2014-09-01';
Total MapReduce CPU Time Spent: 6 minutes 57 seconds 50 msec
OK
Time taken: 205.335 seconds[/mw_shl_code]
ORC格式
ORC格式的表数据一样是23690132条,查询耗时如下:
[mw_shl_code=bash,true]hive> select count(*) from sp_orc.heartbeat where date='2014-09-01';
Total MapReduce CPU Time Spent: 8 seconds 80 msec
OK
23690132
Time taken: 22.058 seconds, Fetched: 1 row(s)
hive> select id, max(power) from sp_orc.heartbeat where date='2014-09-01' group by id;
Time taken: 41.483 seconds, Fetched: 14213 row(s)[/mw_shl_code]
稍微复杂的性能测试
使用一点简单的group by查询,性能对比如下:
[mw_shl_code=bash,true]hive> select id from sp_logs.heartbeat group by id having max(time) < '2014-09-31 12:00:00';
Time taken: 599.524 seconds, Fetched: 679982 row(s)
hive> select id from sp_orc.heartbeat group by id having max(time) < '2014-09-31 12:00:00';
Time taken: 190.929 seconds, Fetched: 679982 row(s)
[/mw_shl_code]
使用线上的数据再进行一下简单的测试:
[mw_shl_code=bash,true]select count(*) from sp_prod_orc.heartbeat where date='2014-10-23';
28669162
Time taken: 24.08 seconds, Fetched: 1 row(s)
select count(*) from sp_prod_source.heartbeat where date='2014-10-23';
28684464
Time taken: 143.85 seconds, Fetched: 1 row(s)[/mw_shl_code]
这样看起来,orc存储的性能提升还是比较明显的。
|
|