Hive distinct 和join on查询语句如何优化

简单描述下情况：

表：
table1
table2
中间表 new_table

HQL：
insert overwrite table new_table select distinct a.id,b.name,b.age from table1 b join table a on a.id=b.id where data='2015-10-01'

同时存在distinct 和 join on 语句，两个表之间的查询，写到到中间表中，
数据很大的情况下，有什么更好的方法可以处理？

求分享

tntzbzc · 发表于 2015-9-13 10:34:36

从简单的语句看不出什么。
关键看两个表的数据量

例如：
Join查找操作的基本原则：应该将条目少的表/子查询放在 Join 操作符的左边。原因是在 Join 操作的 Reduce 阶段，位于 Join 操作符左边的表的内容会被加载进内存，将条目少的表放在左边，可以有效减少发生内存溢出错误的几率。

更多资料：

Hive 查询优化总结
http://www.aboutyun.com/thread-12363-1-1.html

hive优化以及执行原理
http://www.aboutyun.com/thread-13419-1-1.html

Hive中小表与大表关联(join)的性能分析
http://www.aboutyun.com/thread-7816-1-1.html

Hive数据倾斜（大表join大表）【优化】
http://www.aboutyun.com/thread-13077-1-1.html

levycui · 发表于 2015-9-14 16:32:54

tntzbzc 发表于 2015-9-13 10:34
从简单的语句看不出什么。
关键看两个表的数据量

意思是 table1 b join table2 a on 中 table1数据比表2少的，运行会快些

zcfightings · 发表于 2015-9-15 15:13:44

1，group by 代替distinct，因为distinct 用给一个reduce 而group by 是用多个的。
2.小表jion大表要小表在前大表在后如果小表很小可用mapjoin。
语句可改为
select /*+ mapjoin(a)*/ a.id ,b.name,b.age from (select id from table1 group by id) a jion table2 b on (a.id=b.id and b.date=2015-10-1)

levycui · 发表于 2015-9-15 21:37:25

zcfightings 发表于 2015-9-15 15:13
1，group by 代替distinct，因为distinct 用给一个reduce 而group by 是用多个的。
2.小表jion大表要小 ...

非常感谢，试下

图文精华

Hive distinct 和join on查询语句如何优化

本帖被以下淘专辑推荐:

已有(4)人评论

最佳新人

热心会员

推荐 /2