Hive实现wordCount程序

今天自己做个了简单的测试，用hive跑了个wordCount，初学者学习参考。

a. 创建一个数据库，如
create database word;

b. 建表
create external table word_data(line string) row format delimited fields terminated by '\n' stored as textfile location '/home/hadoop/worddata';

这里假设我们的数据存放在hadoop下，路径为：/home/hadoop/worddata，里面主要是一些单词文件，内容大概为：

hello man
what are you doing now
my running
hello
kevin
hi man

执行了上述hql就会创建一张表src_data，内容是这些文件的每行数据，每行数据存在字段line中，select * from word_data;就可以看到这些数据

c. 根据MapReduce的规则，我们需要进行拆分，把每行数据拆分成单词，这里需要用到一个hive的内置表生成函数（UDTF）：explode(array)，参数是array，其实就是行变多列：

create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;

查看words表内容
OK
hello
man
what
are
you
doing
now
my
running
hello
kevin
hi
man

split是拆分函数，跟java的split功能一样，这里是按照空格拆分，所以执行完hql语句，words表里面就全部保存的单个单词

d. 这样基本实现了，因为hql可以group by，所以最后统计语句为：

select word, count(*) from word.words group by word;
注释：word.words 库名称.表名称，group by word这个word是create table words(word string) 命令创建的word string

结果：
are    1
doing 1
hello 2
hi    1
kevin 1
man    2
my    1
now    1
running 1
what 1
you    1

总结：对比写MR和使用hive，还是hive比较简便，对于比较复杂的统计操作可以建一些中间表，或者一些视图之类的。

转载：http://blog.chinaunix.net/uid-25691489-id-5125057.html

图文精华

Hive实现wordCount程序

最佳新人

热心会员

推荐 /2