日志

hive学习时候的练习--来自特别菜的鸟

已有 702 次阅读2016-11-23 21:18 |个人分类:学习笔记

1、建库
   create database mydb;

2、查询数据库
   show databases;

   drop database statementdrop (database|schema) [if exists] database_name
   [restrict|cascade];

   以下是使用cascade查询删除数据库。这意味着要全部删除相应的表在删除数据库之前。
   hive> drop database if exists userdb cascade;

   以下使用schema查询删除数据库。
   hive> drop schema userdb;

3、删除数据库
   drop database mydb;
DROP DATABASE IF EXISTS movie CASCADE;

4、先进入我们要操作的数据库
   use mydb;

5、查看数据库里面的表
   show tables;

6、添加表

   创建内部表（Managered_Table）
   create table mingxing(id int, name string, sex string, age int, department string)
   row format delimited fields terminated by ',';

   创建外部表（External_Table）
   create external table mingxing(id int, name string, sex string, age int, department string)
   row format delimited fields terminated by ',' location '/root/hivedata';
   注意：创建外部表的时候指定location的位置必须是目录，不能是单个文件

   创建分区表
   create table mingxing(id int, name string, sex string, age int, department string)
   partitioned by (city string)
   row format delimited fields terminated by ',';
   注意：partition里的字段不是能是表中声明的字段

   创建分桶表
   create table mingxing(id int, name string, sex string, age int, department string)
   clustered by(id) sorted by(age desc) into 4 buckets
   row format delimited fields terminated by ',';

7、删除表
   drop table [if exists] mingxing

8、对表进行重命名
   alter table mingxing rename to student;

9、对表的字段进行操作（增加add，删除drop，修改change，替换replace）

   增加字段：
   alter table mingxing add columns (province string);
   alter table mingxing add columns (province string, xxx bigint);

   删除字段：
   drop（不支持）

   修改字段：
   alter table mingxing change age newage string;

   替换字段
   alter table mingxing replace columns(id int, name string, sex string);

10、对表中的分区进行操作

   增加分区：
   alter table mingxing add partition(city='beijing');
   alter table mingxing add partition(city='beijing') partition(city='tianjin');

   删除分区：
   alter table mingxing drop partition(city='beijing');

11、查询显示命令

   查看库：show databases;
   查看表：show tables;
   查看内置函数库：show functions;
   查看分区：show partitions mingxing;
   查看表的字段：desc mingxing;
   查看表的详细信息：desc extended mingxing;
   查看表的格式化了之后的详细信息：desc formatted mingxing;
   显示函数的扩展信息:desc fuction extended concat;
12、load方式导入数据

   导入本地相对路径的数据
   load data local inpath './student.txt' into table mingxing;
   load data local inpath './student.txt' overwrite into table mingxing;
   (覆盖导入)

   导入本地绝对路径数据：
   load data local inpath '/root/hivedata/student.txt' into table mingxing;

   导入HDFS上的简便路径数据：
   load data inpath '/root/hivedata/student.txt' into table mingxing;

   导入HDFS上的全路径模式下的数据：
   load data inpath 'hdfs://hadoop01:9000/root/hivedata/student.txt' into table mingxing;

13、利用insert关键字往表中插入数据

   单条数据插入：
   insert into table mingxing values(001,'huangbo','male',50,'MA');

   单重插入模式：
   insert into table student select id,name,sex,age,department from mingxing;
   注意：查询出的字段必须是student表中存在的字段

   多重插入模式：
   from mingxing
   insert into table student1 select id,name,sex,age
   insert into table student2 select id,department;

   from mingxing2
   insert into table student1 partition(department='MA') select id,name,sex ,age where department='MA'
   insert into table student1 partition(department='CS') select id,name,sex ,age where department='CS';

   静态分区插入：
   load data local inpath '/root/hivedata/student.txt' into table student partition(city='henan');

   动态分区插入：
   create table student(name string, department string) partitioned by (id int) .....
   insert into table student partition(id) select name,department,id from mingxing2;
   student表字段：name,department，分区字段是id
   查询字段是：name,department,id，分区字段
   注意：动态分区插入的分区字段必须是查询语句当中出现的字段中的最后一个

   CTAS(create table ... as select ...)(直接把查询出来的结果存储到新建的一张表里)
   create table student as select id,name,age,department from mingxing;
   注意：自动新建的表中的字段和查询语句出现的字段的名称，类型，注释一模一样

14、like关键字使用[只是复制表的结构,而不是复制表的数据]
   create table student like mingxing;

15、利用insert导出数据到本地或者hdfs

   单模式导出数据到本地：
   insert overwrite local directory '/root/outputdata' select id,name,sex,age,department from mingxing;

   多模式导出数据到本地：
   from mingxing
   insert overwrite local directory '/root/outputdata1' select id, name
   insert overwrite local directory '/root/outputdata2' select id, name,age

   简便路径模式导出到hdfs：
   insert overwrite directory '/root/outputdata' select id,name,sex,age,department from mingxing;

   全路径模式查询数据到hdfs：
   insert overwrite directory 'hdfs://hadoop01:9000/root/outputdata1' select id,name,sex,age,department from mingxing;

16、清空数据库表中的数据
   truncate table mingxing2;

17、查询数据

   基本查询：
   select * from mingxing join student where ... group by ... order by ... limit ...

   查询全局有序数据：
   select * from mingxing order by age desc , id asc;

   如果数据量过大，我们采用局部排序的方式：
   set mapred.reduce.tasks=3;
   set mapreduce.job.reduces=3;
   select * from mingxing sort by id asc;

   分桶查询：
   set hive.enforce.bucketing = true;
   select * from mingxing distribute by sex;

   查询排序的分桶数据：
   select * from mingxing cluster by id sort by id desc, age asc;

18、五种链接查询

   内连接inner join：
   select student.*, mingxing.* from student join mingxing on student.id = mingxing.id
   select student.*, mingxing.* from student inner join mingxing on student.id = mingxing.id

   左外链接left outer join:
   select student.*, mingxing.* from student left join mingxing on student.id = mingxing.id
   select student.*, mingxing.* from student left outer join mingxing on student.id = mingxing.id

   右外链接right outer join:
   select student.*, mingxing.* from student right join mingxing on student.id = mingxing.id
   select student.*, mingxing.* from student right outer join mingxing on student.id = mingxing.id

   全外链接full outer join:
   select student.*, mingxing.* from student full join mingxing on student.id = mingxing.id
   select student.*, mingxing.* from student full outer join mingxing on student.id = mingxing.id

   in/exists的hive高效实现left semi join：
   select student.*, mingxing.* from student left semi join mingxing on student.id = mingxing.id;
   等同于：
   select student.* from student where student.id in(select distinct id from mingxing);

19,复杂数据类型:
create table student(
name stirng,
favoter array<string>,
scoress map<string,float>,
address struct<provive:string,city:int>
)
row format delimited fields terminated by ','
collection items terminated by ','
map keys terminated by ',';
collection items terminated by 是元素分隔符（ array 中的各元素、 struct 中的各元素、
map 中的 key、 value 对之间），
map keys terminated by 是 map 中 key 与 value 的分隔符， lines terminated by 是行之
间的分隔符， stored as textfile 指数据文件上传之后保存的格式。

20,hive的命令行操作
hive [-hiveconf x=y]* [<-i filename>]* [<-f filename>|<-e query-string>] [-S]
1、 -i 从文件初始化 HQL。
2、 -e 从命令行执行指定的 HQL
3、 -f 执行 HQL 脚本
4、 -v 输出执行的 HQL 语句到控制台
5、 -p <port> connect to Hive Server on port number
6、 -hiveconf x=y（ Use this to set hive/hadoop configuration variables）

21,视图
careate   view vname as select * from student;
desc vname;
drop view vname;

-------------------------------------------------------------------
特别函数实例:
json解析函数
insert into table rate select get_json_object(line,$.movie) as movie from rat_json;

正则表达式:特殊分隔符处理
create table tname(id stirng,name string) row format serde
'p\org.apache.hadoop.hive.serde2.regexserde' whit serdeperoperties('input.regex'=''(.*)\\|\\|(.*)','output.format.string'='%1$s %2$s')

数据倾斜问题处理
空值产生数据倾斜
select * from log a join user b on a.user id is not null and a.user_id=b.user_id union all select * from log c where c.uer_id is null;
赋予空值新的key
select * from log a left outer join user b on case when a.user id is null then concat('yxy',rand())else a.user id=b.user_id end;
不同数据类型关联产生的数据倾斜
转换需要查询的字段类型
select * from user a left outer join log b on a.user_id=cast(b.user_id as string );
大小表关联问题(map join)

优化:
好的模型
解决数据倾斜
减少job数
设置合理的map reduce的task数
了解数据分布
慎用count distinct groupby等容易产生倾斜问题的函数
对小文件进行合并
优化整体>个体