本帖最后由 songy 于 2015-12-15 13:06 编辑
使用hive distinct或group by 去重复,遇到奇怪问题:
1:我用hive sql, 使用DISTINCT ,或GROUP BY去重复,得到的查询结果为560条。
2:然后我不去重,把所有结果全部查出来,再用shell 对结果去重,结果为557条。
3:接着对两次的结果做了处理,已经找出差异的3条数据的id,返回去查询这3个id,同样使用distinct能查到,不使用却查不到结果。
使用group by 语句:
hive -e 'SELECT user_id FROM log.user_login WHERE login_time BETWEEN UNIX_TIMESTAMP("2015-12-10 00:00:00") AND UNIX_TIMESTAMP("2015-12-10 23:59:59") group by user_id order by user_id;'
结果:
91266439
91933190
92005120
92762464
95589779
96798858
97089159
97558244
Time taken: 97.044 seconds, Fetched: 560 row(s)
使用distinct 语句:
hive -e 'SELECT DISTINCT user_id FROM log.user_login WHERE login_time BETWEEN UNIX_TIMESTAMP("2015-12-10 00:00:00") AND UNIX_TIMESTAMP("2015-12-10 23:59:59") order by user_id;'
结果与group by 一样是 560 row(s)
先全部查询,再手动去重:
hive -e 'SELECT user_id FROM log.user_login WHERE login_time BETWEEN UNIX_TIMESTAMP("2015-12-10 00:00:00") AND UNIX_TIMESTAMP("2015-12-10 23:59:59") ;' | awk '{ array[$0]=$0 } END{ for( id in array ){ print id } }' | sort | nl
结果为557条:
550 91266439
551 91933190
552 92005120
553 92762464
554 95589779
555 96798858
556 97089159
557 97558244
然后找出差异的3条数据,单独查询:
不使用distinct ,查不到:
hive -e 'SELECT user_id FROM log.user_login WHERE login_time BETWEEN UNIX_TIMESTAMP("2015-12-10 00:00:00") AND UNIX_TIMESTAMP("2015-12-10 23:59:59") and user_id in ( "14616294","14623531","25003520") ;'
使用distinct ,能查到差异的3条数据:
hive -e 'SELECT distinct user_id FROM log.user_login WHERE login_time BETWEEN UNIX_TIMESTAMP("2015-12-10 00:00:00") AND UNIX_TIMESTAMP("2015-12-10 23:59:59") and user_id in ( "14616294","14623531","25003520") ;'
OK
14623531
14616294
25003520
怎么会这样奇怪?? 在执行语句过程中我发现使用distinct 或 group by 用到了mapreduce的过程,我也不知道是不是mr的问题,不知道接下来该怎么找问题??
进一步测试,有了意外发现,我把login_time字段也同时查询出来,发现多的这几个id的login_time 时间并不在我限定的时间范围内:
使用 distinct:
hive -e 'SELECT distinct user_id, login_time FROM log.user_login WHERE login_time BETWEEN UNIX_TIMESTAMP("2015-12-10 00:00:00") AND UNIX_TIMESTAMP("2015-12-10 23:59:59") and user_id in ( "14616294","14623531","25003520") ;'
使用 group by:
hive -e 'SELECT user_id, login_time FROM log.user_login WHERE login_time BETWEEN UNIX_TIMESTAMP("2015-12-10 00:00:00") AND UNIX_TIMESTAMP("2015-12-10 23:59:59") and user_id in ( "14616294","14623531","25003520") group by user_id, login_time ;'
均得到相同结果:
14616294 1449772753
25003520 1449807672
14623531 1449772517
14616294 1449772662
Time taken: 68.644 seconds, Fetched: 4 row(s)
然后转换时间戳如下(全为2015-12-11):
1449772753
2015-12-11 02:39:13
1449807672
2015-12-11 12:21:12
1449772517
2015-12-11 02:35:17
1449772662
2015-12-11 02:37:42
很奇怪 为何distinct /group by 会把2015-12-11日的这几个数据查出来, 我明明是要查2015-12-10 的。 为什么会出这个情况啊??
|
-
mr
|