hive3.1.x教程:案例-微博数据分析 作者:马育民 • 2025-12-08 08:05 • 阅读:10003 # 准备数据 详见 [链接](https://www.malaoshi.top/show_1GW2MhArYzwZ.html "链接") ### 上传 创建目录: ``` mkdir /weibo_data -p ``` 上传到 `hadoop1` 的 `/weibo_data/` 目录 # 启动服务 登录 `hadoop1` 服务器 ``` cd /program/bin ``` ### 启动hadoop ``` ./hadoop.sh start ``` ### 启动hive服务 ``` ./hive.sh start ``` # 启动hive客户端 登录 `hadoop1` 服务器,执行 `hive` 命令启动客户端 # hive建表 创建外部表 ``` CREATE EXTERNAL TABLE if not exists `weibo`(`json` string)COMMENT "This is the quova ASN source json table" LOCATION "/weibo" ``` # 上传数据到 hive ``` hadoop fs -put /weibo_data/* /weibo/ ``` # 统计分析1 ### 统计微博总量 ``` select count(*) from weibo; ``` ### 独立用户数 ``` select count(distinct(get_json_object(a.j,"$.userId"))) from (select substring(json,2,length(json)-2) as j from weibo) a; ``` ### 统计用户所有微博被转发的总次数,并输出TOP-3 ``` select b.id,sum(b.cnt) as bsum from(select get_json_object(a.j,"$.userId") as id,get_json_object(a.j,"$.reportCount") as cnt from(select substring(json,2,length(json)-2) as j from weibo) a) b group by b.id order by bsum desc limit 3; ``` ### 统计被转发次数最多的TOP-3微博,输出用户id ``` select get_json_object(a.j,"$.userId") as id, cast(get_json_object(a.j,"$.reportCount") AS INT) as cnt from (select substring(json,2,length(json)-2) as j from weibo) a order by cnt desc limit 3; ``` ### 统计每个用户的发送微博总数,并存储到临时表 ``` create table weibo_uid_wbcnt(userid STRING, wbcnt INT) row format delimited fields terminated by "\t"; ``` ### 统计每个用户的发送微博总数放到临时表中 ``` insert overwrite table weibo_uid_wbcnt select get_json_object(a.j,"$.userId"),count(1) from (select substring(json,2,length(json)-2) as j from weibo) a group by get_json_object(a.j,"$.userId") ; ``` ### 统计带图片的微博数 ``` select count(*) from (select substring(json,2,length(json)-2) as j from weibo) a where get_json_object(a.j, "$.pic_list") like "%http%"; ``` ### 统计使用iphone发微博的独立用户数 ``` select count(distinct get_json_object(a.j,"$.userId")) from (select substring(json,2,length(json)-2) as j from weibo) a where lower(get_json_object(a.j,"$.source")) like "%iphone%"; ``` ### 统计微博中评论次数 `<1000` 的用户ID与数据来源信息,将其放入视图中 ``` create view weibo_view as select get_json_object(a.j, "$.userId") as uid, get_json_object(a.j, "$.source") as source from (select substring(json, 2, length(json) - 2) as j from weibo) a; ``` ### 统计视图中数据来源是“iPad客户端”的用户数目 ``` select count(distinct(uid)) from weibo_view where source="iPad客户端"; ``` # 统计分析2 ### 编写 hive UDF 函数 详见链接: https://www.malaoshi.top/show_1GW2MhjD8got.html ### 实现将微博的点赞人数与转发人数相加求和,并将相加之和降序排列,取前10条记录 向hive中添加自定义jar: ``` add jar /azkaban/weibo_sh/original-hive_weibo-1.0-SNAPSHOT.jar; ``` 创建hive临时函数: ``` create temporary function wb as 'top.malaoshi.weibo.udf.Sum'; ``` 查询微博的点赞人数与转发人数相加求和,并将相加之和降序排列,取前10条记录: ``` select wb(cast(get_json_object(a.j,'$.praiseCount') as int),cast(get_json_object(a.j,'$.reportCount')as int)) as cnt from (select substring(json,2,length(json)-2) as j from weibo) a order by cnt desc limit 10; ``` ### 统计微博内容中出现 **“iphone”** 次数最多的用户,最终结果输出用户ID和次数前10行 向hive中添加自定义jar: ``` add jar /azkaban/weibo_sh/original-hive_weibo-1.0-SNAPSHOT.jar; ``` 创建hive临时函数: ``` create temporary function wcount as 'top.malaoshi.weibo.udf.Count'; ``` 统计微博内容中出现“北京”次数最多的用户,最终结果输出用户ID和次数,取前10行: ``` select b.id,sum(b.cnt) as cn from (select get_json_object(a.j,'$.userId')as id, wcount(get_json_object(a.j,'$.content'),'北京') as cnt from (select substr(json,2,length(json)-2) as j from weibo) a)b group by b.id order by cn desc limit 10; ``` 原文出处:http://www.malaoshi.top/show_1GW2Mr1A0tF8.html