hive3.1.x教程：案例-微博数据分析-马育民老师

# 准备数据

详见 [链接](https://www.malaoshi.top/show_1GW2MhArYzwZ.html "链接")

### 上传

创建目录：

```
mkdir /weibo_data -p
```

上传到 `hadoop1` 的 `/weibo_data/` 目录

# 启动服务

```
cd /program/bin
```

### 启动hadoop

```
./hadoop.sh start
```

### 启动hive服务

```
./hive.sh start
```

# 启动hive客户端

# hive建表

创建外部表
```
CREATE EXTERNAL TABLE if not exists `weibo`(`json` string)COMMENT "This is the quova ASN source json table" LOCATION "/weibo"
```

# 上传数据到 hive

```
hadoop fs -put /weibo_data/* /weibo/
```

# 统计分析1

### 统计微博总量

```
select count(*) from weibo;
```

### 独立用户数

```
select count(distinct(get_json_object(a.j,"$.userId"))) from (select substring(json,2,length(json)-2) as j from weibo) a;
```

### 统计用户所有微博被转发的总次数，并输出TOP-3

```
select b.id,sum(b.cnt) as bsum from(select get_json_object(a.j,"$.userId") as id,get_json_object(a.j,"$.reportCount") as cnt from(select substring(json,2,length(json)-2) as j from weibo) a) b group by b.id order by bsum desc limit 3;
```

### 统计被转发次数最多的TOP-3微博，输出用户id

```
select get_json_object(a.j,"$.userId") as id, cast(get_json_object(a.j,"$.reportCount") AS INT) as cnt from (select substring(json,2,length(json)-2) as j from weibo) a order by cnt desc limit 3;
```

### 统计每个用户的发送微博总数，并存储到临时表

```
create table weibo_uid_wbcnt(userid STRING, wbcnt INT) row format delimited fields terminated  by "\t";
```

### 统计每个用户的发送微博总数放到临时表中

```
insert overwrite table weibo_uid_wbcnt select get_json_object(a.j,"$.userId"),count(1) from (select substring(json,2,length(json)-2) as j from weibo) a group by get_json_object(a.j,"$.userId") ;
```

### 统计带图片的微博数

```
select count(*) from (select substring(json,2,length(json)-2) as j from weibo) a where get_json_object(a.j, "$.pic_list") like "%http%";
```

### 统计使用iphone发微博的独立用户数

```
select count(distinct get_json_object(a.j,"$.userId")) from (select substring(json,2,length(json)-2) as j from weibo) a where lower(get_json_object(a.j,"$.source")) like "%iphone%";
```

### 统计微博中评论次数 `<1000` 的用户ID与数据来源信息，将其放入视图中

```
create view weibo_view as select get_json_object(a.j, "$.userId") as uid, get_json_object(a.j, "$.source") as source from (select substring(json, 2, length(json) - 2) as j from weibo) a;
```

### 统计视图中数据来源是“iPad客户端”的用户数目

```
select count(distinct(uid)) from weibo_view where source="iPad客户端";
```

# 统计分析2

### 编写 hive UDF 函数

详见链接：
https://www.malaoshi.top/show_1GW2MhjD8got.html

### 实现将微博的点赞人数与转发人数相加求和，并将相加之和降序排列，取前10条记录

向hive中添加自定义jar：
```
add jar /azkaban/weibo_sh/original-hive_weibo-1.0-SNAPSHOT.jar;
```

创建hive临时函数：

```
create temporary function wb as 'top.malaoshi.weibo.udf.Sum';
```

查询微博的点赞人数与转发人数相加求和，并将相加之和降序排列，取前10条记录：
```
select wb(cast(get_json_object(a.j,'$.praiseCount') as int),cast(get_json_object(a.j,'$.reportCount')as int)) as cnt from (select substring(json,2,length(json)-2) as j from weibo) a order by cnt desc limit 10;
```

### 统计微博内容中出现 **“iphone”** 次数最多的用户，最终结果输出用户ID和次数前10行

向hive中添加自定义jar：

```
add jar /azkaban/weibo_sh/original-hive_weibo-1.0-SNAPSHOT.jar;
```

创建hive临时函数：

```
create temporary function wcount as 'top.malaoshi.weibo.udf.Count';
```

统计微博内容中出现“北京”次数最多的用户，最终结果输出用户ID和次数，取前10行：

```
select b.id,sum(b.cnt) as cn from (select get_json_object(a.j,'$.userId')as id, wcount(get_json_object(a.j,'$.content'),'北京') as cnt from (select substr(json,2,length(json)-2) as j from weibo) a)b group by b.id order by cn desc limit 10;
```

原文出处：http://www.malaoshi.top/show_1GW2Mr1A0tF8.html