你说的这种方式是可以的,HIVE本身支持mapreduce外部脚本的。使用hadoop Streaming、TRANSFORM、MAP、REDUCE子句这样的方法,便可以在HIVE中调用外部脚本。
calcwin.py
- #!/usr/bin/env python
-
- import sys
- import datetime
-
- def calcwin():
- for line in sys.stdin:
-
- (ldate,userid,roundbet,fold,allin,chipwon)=line.strip().split()
- win = '0'
- if fold=='1':
-
- print '\t'.join(["%s:%s"%(ldate,userid),win,fold,allin])
- continue
- cw = []
- if chipwon == "NULL":
- print '\t'.join(["%s:%s"%(ldate,userid),win,fold,allin])
- continue
- #print "userid win ",userid
- cw=chipwon.split('|')
- chipwonv=0
- roundbetv=int(roundbet)
- for v in cw:
- chipwonv += int(v.split(':')[1])
-
-
- if chipwonv > roundbetv:
- win = '1'
-
- print '\t'.join(["%s:%s"%(ldate,userid),win,fold,allin])
- calcwin()
复制代码
原始数据格式:
- hive> !hadoop fs -cat /flume/test/testpoker;
- 03/13/13 14:59:51 00000ab4 1009 185690475 8639 240 1 0 -1 NULL NULL
- 03/13/13 14:59:51 00000cb4 1009 187270278 92030 600 1 0 -1 NULL NULL
- 03/13/13 14:59:52 000003d8 1009 184151687 8639 600 1 0 -1 NULL NULL
- 03/13/13 14:59:52 00000ba8 1009 186012530 8593 154135 0 1 7 8|21|16|42|39 0:73250|1:60500|2:100135
- 03/13/13 14:59:52 00000a88 1009 180286243 92041 100 1 0 -1 NULL NULL
- 03/13/13 14:59:52 00000ad8 1009 163003653 2829 40 1 0 -1 NULL NULL
- 03/13/13 14:59:54 000002ac 1009 183824880 8639 1200 0 0 -1 NULL 0:1900
- 03/13/13 14:59:55 0000091c 1009 173274868 92030 600 0 0 -1 NULL 0:1150
复制代码
然后,在hive命名行添加calcwin.py- hive> add file calcwin.py
- hive> from testpoker select transform(ldate,userid,roundbet,fold,allin,chipwon) using 'calcpoker.py' as (key,win,fold,allin) ;
- ...
- OK
- 03/13/13:185690475 0 1 0
- 03/13/13:187270278 0 1 0
- 03/13/13:184151687 0 1 0
- 03/13/13:186012530 1 0 1
复制代码
HIVE mapreduce in java 连接,连接上对应的源代码及测试例子。
我从网上搜了搜,基本上都是凭python写的mapreduce。
|