问题导读:
1.Pig是否可以与stream、mapreduce相结合?
2.pig中如何执行多条命令?
3.pig join不能放入内存会出现什么问题?
4.pig UNION的作用是什么?
1. foreach
1.1 flatten()
作用:
去除bag或tuple的嵌套
有时候不解嵌套的数据不便于处理,解嵌套的结果为一个又一个的tuple
eg
pos = foreach players generate name, flatten(position) as position;
比如某一行为:
Jorge Posada,{(Catcher),(Designated_hitter)},
则会变为:
(Jorge Posada,Catcher)
(Jorge Posada,Designated_hitter)
eg
C = FOREACH B GENERATE group, AVG(A.col5), AVG(A.col6);
结果为:
((1,2,3),2.8,5.0)
((1,2,5),7.7,5.9)
((3,0,5),3.5,2.1)
((7,9,9),2.6,6.2)
C = FOREACH B GENERATE FLATTEN(group), AVG(A.col5), AVG(A.col6);
则结果为:
(1,2,3,2.8,5.0)
(1,2,5,7.7,5.9)
(3,0,5,3.5,2.1)
(7,9,9,2.6,6.2)
--!!--如果flatten的对象是None,则不会生成对象
eg
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} : position)
as position;
pos = foreach noempty generate name, flatten(position) as position;
1.2 嵌套(nested) foreach
foreach执行多条命令
eg
uniqcnt = foreach grpd {
sym = daily.symbol;
uniq_sym = distinct sym;
generate group, COUNT(uniq_sym);
};
1.3 count()
--!!--count()需要注意NULL
eg
A = LOAD 'a.txt' AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);
B = FOREACH GENERATE A COUNT(A);
B = FOREACH GENERATE A COUNT(A.col2);
当A.col2中间含有null时,会和count(A)结果不同
2. join
2.1 using 'replicated'
适用于join中某一项很小,可以放入内存中
放入内存中之后,MapReduce可以在map过程中结束join,以提高效率
如果不能放入内存,则会抛出错误
eg
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
--!!--只支持inner和left outer JOIN
2.2 using 'skewed'
#TODO
2.3 using 'merge'
#TODO
3. COGROUP
用于分组,和GROUP不同的是COGROUP可以按照关系中多个字段分组
eg
[root@localhost pig]$ cat a.txt
uidk 12 3
hfd 132 99
bbN 463 231
UFD 13 10
[root@localhost pig]$ cat b.txt
908 uidk 888
345 hfd 557
28790 re 00000
grunt> A = LOAD 'a.txt' AS (acol1:chararray, acol2:int, acol3:int);
grunt> B = LOAD 'b.txt' AS (bcol1:int, bcol2:chararray, bcol3:int);
grunt> C = COGROUP A BY acol1, B BY bcol2;
grunt> DUMP C;
(re,{},{(28790,re,0)})
(UFD,{(UFD,13,10)},{})
(bbN,{(bbN,463,231)},{})
(hfd,{(hfd,132,99)},{(345,hfd,557)})
(uidk,{(uidk,12,3)},{(908,uidk,888)})
--输出的第一项为分组Key,第二项和第三项分别为一个Bag
4. UNION
使用左边(A)的类标
当匹配类型不一致时,会变为更大的数据类型,如float和double合并会变为double
eg
[root@localhost ~]# cat 1.txt
0 3
1 5
0 8
[root@localhost ~]# cat 2.txt
1 6
0 9
A = LOAD '1.txt' AS (a: int, b: int);
B = LOAD '2.txt' AS (c: int, d: int);
C = UNION A, B;
0 3
1 5
0 8
1 6
0 9
DESCRIBE C;
C: {a: int,b: int}
如果两项schema不同,可以强制合并
eg
A = load 'input1' as (w:chararray, x:int, y:float);
B = load 'input2' as (x:int, y:double, z:chararray);
C = union onschema A, B;
describe C;
C: {w: chararray,x: int,y: double,z: chararray}
5. cross
叉乘
用法:
CROSS ... , ...
--Pig与其他交互--
1. stream
用法:
STREAM ... THROUGH ... AS ...
eg
highdivs = stream divs through `highdiv.pl` as (exchange, symbol, date, dividends);
使用highdiv.pl处理divs
并且输出为(exchange, symbol, date, dividends),主要是因为pig不知道schema,需要用户来指定
2. mapreduce
MapReduce和Pig交互
适用于必须在Pig中使用但用MapReduce更适合的过程
|
|