问题导读:
1、如何设计ODS层?
2、如何使用DWD层对ODS层数据进行判空过滤?
3、DWS层之用户行为宽表如何设计?
4、用户行为数据宽表如何导入脚本数据?
上一篇:大数据项目之电商数仓(总结)(二):系统业务数据仓库
3.3 ODS层
完全仿照业务数据库中的表字段,一模一样的创建ODS层对应表。
3.3.1 创建订单表
- hive (gmall)>
- drop table if exists ods_order_info;
- create external table ods_order_info (
- `id` string COMMENT '订单编号',
- `total_amount` decimal(10,2) COMMENT '订单金额',
- `order_status` string COMMENT '订单状态',
- `user_id` string COMMENT '用户id',
- `payment_way` string COMMENT '支付方式',
- `out_trade_no` string COMMENT '支付流水号',
- `create_time` string COMMENT '创建时间',
- `operate_time` string COMMENT '操作时间'
- ) COMMENT '订单表'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_order_info/';
复制代码
3.3.2 创建订单详情表
- hive (gmall)>
- drop table if exists ods_order_detail;
- create external table ods_order_detail(
- `id` string COMMENT '订单详情编号',
- `order_id` string COMMENT '订单号',
- `user_id` string COMMENT '用户id',
- `sku_id` string COMMENT '商品id',
- `sku_name` string COMMENT '商品名称',
- `order_price` string COMMENT '商品单价',
- `sku_num` string COMMENT '商品数量',
- `create_time` string COMMENT '创建时间'
- ) COMMENT '订单明细表'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_order_detail/';
复制代码
3.3.3 创建商品表
- hive (gmall)>
- drop table if exists ods_sku_info;
- create external table ods_sku_info(
- `id` string COMMENT 'skuId',
- `spu_id` string COMMENT 'spuid',
- `price` decimal(10,2) COMMENT '价格',
- `sku_name` string COMMENT '商品名称',
- `sku_desc` string COMMENT '商品描述',
- `weight` string COMMENT '重量',
- `tm_id` string COMMENT '品牌id',
- `category3_id` string COMMENT '品类id',
- `create_time` string COMMENT '创建时间'
- ) COMMENT '商品表'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_sku_info/';
复制代码
3.3.4 创建用户表
- hive (gmall)>
- drop table if exists ods_user_info;
- create external table ods_user_info(
- `id` string COMMENT '用户id',
- `name` string COMMENT '姓名',
- `birthday` string COMMENT '生日',
- `gender` string COMMENT '性别',
- `email` string COMMENT '邮箱',
- `user_level` string COMMENT '用户等级',
- `create_time` string COMMENT '创建时间'
- ) COMMENT '用户信息'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_user_info/';
复制代码
3.3.5 创建商品一级分类表
- hive (gmall)>
- drop table if exists ods_base_category1;
- create external table ods_base_category1(
- `id` string COMMENT 'id',
- `name` string COMMENT '名称'
- ) COMMENT '商品一级分类'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_base_category1/';
复制代码
3.3.6 创建商品二级分类表
- hive (gmall)>
- drop table if exists ods_base_category2;
- create external table ods_base_category2(
- `id` string COMMENT ' id',
- `name` string COMMENT '名称',
- category1_id string COMMENT '一级品类id'
- ) COMMENT '商品二级分类'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_base_category2/';
复制代码
3.3.7 创建商品三级分类表
- hive (gmall)>
- drop table if exists ods_base_category3;
- create external table ods_base_category3(
- `id` string COMMENT ' id',
- `name` string COMMENT '名称',
- category2_id string COMMENT '二级品类id'
- ) COMMENT '商品三级分类'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_base_category3/';
复制代码
3.3.8 创建支付流水表
- hive (gmall)>
- drop table if exists ods_payment_info;
- create external table ods_payment_info(
- `id` bigint COMMENT '编号',
- `out_trade_no` string COMMENT '对外业务编号',
- `order_id` string COMMENT '订单编号',
- `user_id` string COMMENT '用户编号',
- `alipay_trade_no` string COMMENT '支付宝交易流水编号',
- `total_amount` decimal(16,2) COMMENT '支付金额',
- `subject` string COMMENT '交易内容',
- `payment_type` string COMMENT '支付类型',
- `payment_time` string COMMENT '支付时间'
- ) COMMENT '支付流水表'
- PARTITIONED BY (`dt` string)
- row format delimited fields terminated by '\t'
- location '/warehouse/gmall/ods/ods_payment_info/';
复制代码
3.3.9 ODS层数据导入脚本
1)在/home/kgg/bin目录下创建脚本ods_db.sh
- [kgg@hadoop102 bin]$ vim ods_db.sh
- 在脚本中填写如下内容
- #!/bin/bash
-
- APP=gmall
- hive=/opt/module/hive/bin/hive
-
- # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
- if [ -n "$1" ] ;then
- do_date=$1
- else
- do_date=`date -d "-1 day" +%F`
- fi
-
- sql="
- load data inpath '/origin_data/$APP/db/order_info/$do_date' OVERWRITE into table "$APP".ods_order_info partition(dt='$do_date');
-
- load data inpath '/origin_data/$APP/db/order_detail/$do_date' OVERWRITE into table "$APP".ods_order_detail partition(dt='$do_date');
-
- load data inpath '/origin_data/$APP/db/sku_info/$do_date' OVERWRITE into table "$APP".ods_sku_info partition(dt='$do_date');
-
- load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table "$APP".ods_user_info partition(dt='$do_date');
-
- load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table "$APP".ods_payment_info partition(dt='$do_date');
-
- load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table "$APP".ods_base_category1 partition(dt='$do_date');
-
- load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table "$APP".ods_base_category2 partition(dt='$do_date');
-
- load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table "$APP".ods_base_category3 partition(dt='$do_date');
- "
- $hive -e "$sql"
复制代码
2)增加脚本执行权限
- [kgg@hadoop102 bin]$ chmod 777 ods_db.sh
复制代码
3)执行脚本导入数据
- [kgg@hadoop102 bin]$ ods_db.sh 2019-02-10
- [kgg@hadoop102 bin]$ ods_db.sh 2019-02-11
复制代码
4)查询导入数据
- hive (gmall)>
- select * from ods_order_info where dt='2019-02-10' limit 1;
- select * from ods_order_info where dt='2019-02-11' limit 1;
复制代码
3.4 DWD层
对ODS层数据进行判空过滤。对商品分类表进行维度退化(降维)。
3.4.1 创建订单表
- hive (gmall)>
- drop table if exists dwd_order_info;
- create external table dwd_order_info (
- `id` string COMMENT '',
- `total_amount` decimal(10,2) COMMENT '',
- `order_status` string COMMENT ' 1 2 3 4 5',
- `user_id` string COMMENT 'id',
- `payment_way` string COMMENT '',
- `out_trade_no` string COMMENT '',
- `create_time` string COMMENT '',
- `operate_time` string COMMENT ''
- )
- PARTITIONED BY (`dt` string)
- stored as parquet
- location '/warehouse/gmall/dwd/dwd_order_info/'
- tblproperties ("parquet.compression"="snappy");
复制代码
3.4.2 创建订单详情表
- hive (gmall)>
- drop table if exists dwd_order_detail;
- create external table dwd_order_detail(
- `id` string COMMENT '',
- `order_id` decimal(10,2) COMMENT '',
- `user_id` string COMMENT 'id',
- `sku_id` string COMMENT 'id',
- `sku_name` string COMMENT '',
- `order_price` string COMMENT '',
- `sku_num` string COMMENT '',
- `create_time` string COMMENT ''
- )
- PARTITIONED BY (`dt` string)
- stored as parquet
- location '/warehouse/gmall/dwd/dwd_order_detail/'
- tblproperties ("parquet.compression"="snappy");
复制代码
3.4.3 创建用户表
- hive (gmall)>
- drop table if exists dwd_user_info;
- create external table dwd_user_info(
- `id` string COMMENT 'id',
- `name` string COMMENT '',
- `birthday` string COMMENT '',
- `gender` string COMMENT '',
- `email` string COMMENT '',
- `user_level` string COMMENT '',
- `create_time` string COMMENT ''
- )
- PARTITIONED BY (`dt` string)
- stored as parquet
- location '/warehouse/gmall/dwd/dwd_user_info/'
- tblproperties ("parquet.compression"="snappy");
复制代码
3.4.4 创建支付流水表
- hive (gmall)>
- drop table if exists dwd_payment_info;
- create external table dwd_payment_info(
- `id` bigint COMMENT '',
- `out_trade_no` string COMMENT '',
- `order_id` string COMMENT '',
- `user_id` string COMMENT '',
- `alipay_trade_no` string COMMENT '',
- `total_amount` decimal(16,2) COMMENT '',
- `subject` string COMMENT '',
- `payment_tpe` string COMMENT '',
- `payment_time` string COMMENT ''
- )
- PARTITIONED BY (`dt` string)
- stored as parquet
- location '/warehouse/gmall/dwd/dwd_payment_info/'
- tblproperties ("parquet.compression"="snappy");
复制代码
3.4.5 创建商品表(增加分类)
- hive (gmall)>
- drop table if exists dwd_sku_info;
- create external table dwd_sku_info(
- `id` string COMMENT 'skuId',
- `spu_id` string COMMENT 'spuid',
- `price` decimal(10,2) COMMENT '',
- `sku_name` string COMMENT '',
- `sku_desc` string COMMENT '',
- `weight` string COMMENT '',
- `tm_id` string COMMENT 'id',
- `category3_id` string COMMENT '1id',
- `category2_id` string COMMENT '2id',
- `category1_id` string COMMENT '3id',
- `category3_name` string COMMENT '3',
- `category2_name` string COMMENT '2',
- `category1_name` string COMMENT '1',
- `create_time` string COMMENT ''
- )
- PARTITIONED BY (`dt` string)
- stored as parquet
- location '/warehouse/gmall/dwd/dwd_sku_info/'
- tblproperties ("parquet.compression"="snappy");
复制代码
3.4.6 DWD层数据导入脚本
1)在/home/kgg/bin目录下创建脚本dwd_db.sh
- [kgg@hadoop102 bin]$ vim dwd_db.sh
- 在脚本中填写如下内容
- #!/bin/bash
-
- # 定义变量方便修改
- APP=gmall
- hive=/opt/module/hive/bin/hive
-
- # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
- if [ -n "$1" ] ;then
- do_date=$1
- else
- do_date=`date -d "-1 day" +%F`
- fi
-
- sql="
-
- set hive.exec.dynamic.partition.mode=nonstrict;
-
- insert overwrite table "$APP".dwd_order_info partition(dt)
- select * from "$APP".ods_order_info
- where dt='$do_date' and id is not null;
-
- insert overwrite table "$APP".dwd_order_detail partition(dt)
- select * from "$APP".ods_order_detail
- where dt='$do_date' and id is not null;
-
- insert overwrite table "$APP".dwd_user_info partition(dt)
- select * from "$APP".ods_user_info
- where dt='$do_date' and id is not null;
-
- insert overwrite table "$APP".dwd_payment_info partition(dt)
- select * from "$APP".ods_payment_info
- where dt='$do_date' and id is not null;
-
- insert overwrite table "$APP".dwd_sku_info partition(dt)
- select
- sku.id,
- sku.spu_id,
- sku.price,
- sku.sku_name,
- sku.sku_desc,
- sku.weight,
- sku.tm_id,
- sku.category3_id,
- c2.id category2_id,
- c1.id category1_id,
- c3.name category3_name,
- c2.name category2_name,
- c1.name category1_name,
- sku.create_time,
- sku.dt
- from
- "$APP".ods_sku_info sku
- join "$APP".ods_base_category3 c3 on sku.category3_id=c3.id
- join "$APP".ods_base_category2 c2 on c3.category2_id=c2.id
- join "$APP".ods_base_category1 c1 on c2.category1_id=c1.id
- where sku.dt='$do_date' and c2.dt='$do_date'
- and c3.dt='$do_date' and c1.dt='$do_date'
- and sku.id is not null;
- "
-
- $hive -e "$sql"
复制代码
2)增加脚本执行权限
- [kgg@hadoop102 bin]$ chmod 777 dwd_db.sh
复制代码
3)执行脚本导入数据
- [kgg@hadoop102 bin]$ dwd_db.sh 2019-02-10
- [kgg@hadoop102 bin]$ dwd_db.sh 2019-02-11
复制代码
4)查看导入数据
- hive (gmall)>
- select * from dwd_sku_info where dt='2019-02-10' limit 2;
- select * from dwd_sku_info where dt='2019-02-11' limit 2;
复制代码
3.4.3 小结
思考:
1)维度退化要付出什么代价?
如果被退化的维度,还有其他业务表使用,退化后处理起来就麻烦些。
2)想想在实际业务中还有那些维度表可以退化
城市的三级分类(省、市、县)等
3.5 DWS层之用户行为宽表
1)为什么要建宽表
需求目标,把每个用户单日的行为聚合起来组成一张多列宽表,以便之后关联用户维度信息后进行,不同角度的统计分析。
3.5.1 创建用户行为宽表
- hive (gmall)>
- drop table if exists dws_user_action;
- create external table dws_user_action
- (
- user_id string comment '用户 id',
- order_count bigint comment '下单次数 ',
- order_amount decimal(16,2) comment '下单金额 ',
- payment_count bigint comment '支付次数',
- payment_amount decimal(16,2) comment '支付金额 ',
- comment_count bigint comment '评论次数'
- ) COMMENT '每日用户行为宽表'
- PARTITIONED BY (`dt` string)
- stored as parquet
- location '/warehouse/gmall/dws/dws_user_action/';
复制代码
3.5.2 向用户行为宽表导入数据
1)导入数据
- hive (gmall)>
- with
- tmp_order as
- (
- select
- user_id,
- count(*) order_count,
- sum(oi.total_amount) order_amount
- from dwd_order_info oi
- where date_format(oi.create_time,'yyyy-MM-dd')='2019-02-10'
- group by user_id
- ) ,
- tmp_payment as
- (
- select
- user_id,
- sum(pi.total_amount) payment_amount,
- count(*) payment_count
- from dwd_payment_info pi
- where date_format(pi.payment_time,'yyyy-MM-dd')='2019-02-10'
- group by user_id
- ),
- tmp_comment as
- (
- select
- user_id,
- count(*) comment_count
- from dwd_comment_log c
- where date_format(c.dt,'yyyy-MM-dd')='2019-02-10'
- group by user_id
- )
-
- insert overwrite table dws_user_action partition(dt='2019-02-10')
- select
- user_actions.user_id,
- sum(user_actions.order_count),
- sum(user_actions.order_amount),
- sum(user_actions.payment_count),
- sum(user_actions.payment_amount),
- sum(user_actions.comment_count)
- from
- (
- select
- user_id,
- order_count,
- order_amount,
- 0 payment_count,
- 0 payment_amount,
- 0 comment_count
- from tmp_order
-
- union all
- select
- user_id,
- 0,
- 0,
- payment_count,
- payment_amount,
- 0
- from tmp_payment
-
- union all
- select
- user_id,
- 0,
- 0,
- 0,
- 0,
- comment_count
- from tmp_comment
- ) user_actions
- group by user_id;
复制代码
2)查询导入结果
- hive (gmall)> select * from dws_user_action;
复制代码
3.5.3 用户行为数据宽表导入脚本
1)在/home/kgg/bin目录下创建脚本dws_db_wide.sh
- [kgg@hadoop102 bin]$ vim dws_db_wide.sh
- 在脚本中填写如下内容
- #!/bin/bash
-
- # 定义变量方便修改
- APP=gmall
- hive=/opt/module/hive/bin/hive
-
- # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
- if [ -n "$1" ] ;then
- do_date=$1
- else
- do_date=`date -d "-1 day" +%F`
- fi
-
- sql="
-
- with
- tmp_order as
- (
- select
- user_id,
- sum(oi.total_amount) order_amount,
- count(*) order_count
- from "$APP".dwd_order_info oi
- where date_format(oi.create_time,'yyyy-MM-dd')='$do_date'
- group by user_id
- ) ,
- tmp_payment as
- (
- select
- user_id,
- sum(pi.total_amount) payment_amount,
- count(*) payment_count
- from "$APP".dwd_payment_info pi
- where date_format(pi.payment_time,'yyyy-MM-dd')='$do_date'
- group by user_id
- ),
- tmp_comment as
- (
- select
- user_id,
- count(*) comment_count
- from "$APP".dwd_comment_log c
- where date_format(c.dt,'yyyy-MM-dd')='$do_date'
- group by user_id
- )
-
- Insert overwrite table "$APP".dws_user_action partition(dt='$do_date')
- select
- user_actions.user_id,
- sum(user_actions.order_count),
- sum(user_actions.order_amount),
- sum(user_actions.payment_count),
- sum(user_actions.payment_amount),
- sum(user_actions.comment_count)
- from
- (
- select
- user_id,
- order_count,
- order_amount,
- 0 payment_count,
- 0 payment_amount,
- 0 comment_count
- from tmp_order
-
- union all
- select
- user_id,
- 0,
- 0,
- payment_count,
- payment_amount,
- 0
- from tmp_payment
-
- union all
- select
- user_id,
- 0,
- 0,
- 0,
- 0,
- comment_count
- from tmp_comment
- ) user_actions
- group by user_id;
- "
-
- $hive -e "$sql"
复制代码
2)增加脚本执行权限
- [kgg@hadoop102 bin]$ chmod 777 dws_db_wide.sh
复制代码
3)执行脚本导入数据
- [kgg@hadoop102 bin]$ dws_db_wide.sh 2019-02-11
复制代码
4)查看导入数据
- hive (gmall)>
- select * from dws_user_action where dt='2019-02-11' limit 2;
复制代码
最新经典文章,欢迎关注公众号
|