HIVE-dml（Data Manipulation Language）

发布于 2024-07-03 01:55:24 字数 10728 浏览 43 评论 0

显示系统内容

show functions; : 显示全部已有函数
[desc|describe] function substr; : 某函数的描述
[desc|describe] formatted <TABLE> : 会告知是 managed_table 还是 external_table

load 命令导入数据

load 命令不会校验数据和 schema 的差异，LOCAL 和 OVERWRITE 是可选参数

-- omitted LOCAL will find in HDFS
-- omitted OVERWRITE data files are appended to existing data sets
LOAD DATA [LOCAL] INPATH './examples/files/kv2.txt' [OVERWRITE] INTO TABLE table_name PARTITION (ds='2008-08-15');

insert 命令导入文件或表

-- INSERT 这个关键字可以忽略
-- omitted LOCAL will insert into HDFS
INSERT OVERWRITE [LOCAL] DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

MULTITABLE INSERT

FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

设置相关

hive 设置动态分区,并插入动态分区表

set hive.exec.dynamic.partition=true;
insert overwrite table table_name partition(part_desc) select column from srv_table_name

Join

LanguageManual Joins : 关于如何 join 及如果优化的手册

功能实现

正则

where 中使用正则： rlike 和 regexp 用的是 Java regular
判断字段 field 是否是全都是数字的形式: where field rlike '^\\d+$'
正则提取连续的数字, eg : regexp_extract(field, '[0-9]*', 0)
正则提取连续的数字, eg : regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)

条件相关

if coalesce 和 case when 三者的区别

Hive supports three types of conditional functions . These functions are listed below:

IF( Test Condition, True Value, False Value ) : The IF condition evaluates the "Test Condition" and if the "Test Condition" is true, then it returns the "True Value". Otherwise, it returns the False Value. Example: IF(1=1, 'working', 'not working') returns 'working'
COALESCE( value1,value2,... ) : The COALESCE function returns the fist not NULL value from the list of values. If all the values in the list are NULL, then it returns NULL. Example: COALESCE(NULL,NULL,5,NULL,4) returns 5. COALESCE(column,CAST(0 AS BIGINT)) and COALESCE(column, 0L)
CASE Statement : The syntax for the case statement is:
```
CASE  [ expression ]
  WHEN condition1 THEN result1
  WHEN condition2 THEN result2
  ...
  WHEN conditionn THEN resultn
  ELSE result
END
```
Here expression is optional. It is the value that you are comparing to the list of conditions. (ie: condition1, condition2, ... conditionn).
All the conditions must be of same datatype . Conditions are evaluated in the order listed. Once a condition is found to be true, the case statement will return the result and not evaluate the conditions any further.
All the results must be of same datatype. This is the value returned once a condition is found to be true.
IF no condition is found to be true, then the case statement will return the value in the ELSE clause. If the ELSE clause is omitted and no condition is found to be true, then the case statement will return NULL
Example:
```
CASE Fruit
  WHEN 'APPLE' THEN 'The owner is APPLE'
  WHEN 'ORANGE' THEN 'The owner is ORANGE'
  ELSE 'It is another Fruit'
END
The other form of CASE is

CASE
  WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
  WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
  ELSE 'It is another Fruit'
END
```

时间相关

hive 中将 string 转成 date 类型
- 最简单 cast('2017-01-01' as date) 只能是 yyyy-MM-dd 类型
- 任意类型通过 unix 时间转 from_unixtime(bigint unix_timestamp(string '20170101', 'yyyyMMdd'), 'yyyy-MM-dd')
计算时间差
- datediff('2017-01-01', '2017-01-02') 前面时间-后面时间，参数接受默认的格式是'yyyy-MM-dd'
- 结合 from_unixtimestamp 和 unix_timestamp 两个函数，可以实现各种类型时间时间差的计算

FAQ

hive 中，大表与大表关联，关联的 key 发生数据倾斜，对关联的 key 作分桶，能提高 3 倍效率左右
hiveserver2 中,如果直接使用 spark 的 start-thriftserver.sh 启动且 spark 版本在 1.5.1 或者左右的, insert into table table values 失败的话,可以使用 insert into table <tablename> select * from (select 'col1', 'col2') t; 完成数据插入
直接将 hive 分区表的数据直接插入 hdfs 之后在 hive 客户端找不到对应的分区和数据,这是因为在 hive 的元数据库中没有相应的分区信息,可以在 hive 客户端中运行 msck repair table <tablename> 方式让 hive 自动发现分区,或者直接用 HQL 更新元数据 alter table <tablename> add partition (partition_column) location '/hadoop-file-system-location';

删除分区

删除指定分区: alter table <TABLE_NAME> drop partition(ds='20180101')
删除所有分区: alter table <TABLE_NAME> drop partition(ds<>'')
删除一个范围的分区: alter table <TABLE_NAME> drop partition(ds>'20180414',ds<'20180416')

插入动态分区 GC 异常

insert overwrite table partition(part_desc)
select *
from src_table

可能会导致 GC Overhead or Heap space error ,原因是插入同一分区的记录要求 select 语句中的字段是一样的,如果不一致会导致在插入的过程中 hive 要做很多判断,要避免这样的情况发生要保证同一分区的记录在同一个 reduce 中，可以将 sql 改成如下形式， shuffle 数量最好大于或者等于动态分区数,加上 DISTRIBUTE BY 之后可以按照分区字段进行分片,最后一个 select 的字段要是分区字段

set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.shuffle.partitions=2000;
insert overwrite table partition(part_desc)
select *
from src_table
DISTRIBUTE BY part_desc

使用 group by 替代 distinct

distinct 会将所有数据放到同一个 reduce 中进行去重，导致数据倾斜,而 group by 则会产生多个 reduce 并将相同聚合值的字段放到同一个 reduce 中，避免了数据倾斜，同时可以指定 set mapred.reduce.tasks=100 每个 reduce 的数量 Hive 能计算出数据量很大的数据集，但是不能计算数据倾斜的数据集

order by, distribute by, sort by, cluster by

order by : order by 会对输入做全局排序，因此只有 一个 Reducer (多个 Reducer 无法保证全局有序)，然而只有一个 Reducer，会导致当输入规模较大时，消耗较长的计算时间. 部分的 order by 可以使用 distribute by 和 sort by 完成
sort by : sort by 不是全局排序，其在数据 进入 reducer 前 完成排序，因此，如果用 sort by 进行排序，并且设置 mapred.reduce.tasks>1 ，则 sort by 只会 保证每个 reducer 的输出有序 ，并不保证全局有序。sort by 不同于 order by，它不受 hive.mapred.mode 属性的影响，sort by 的数据只能保证在同一个 reduce 中的数据可以按指定字段排序。使用 sort by 你可以指定执行的 reduce 个数（通过 set mapred.reduce.tasks=n 来指定)，对输出的数据再执行归并排序，即可得到全部结果
distribute by : distribute by 是 控制 map 端如何拆分数据给 reduce 端 的。hive 会根据 distribute by 后面列，对应 reduce 的个数进行分发，默认是采用 hash 算法。sort by 为每个 reduce 产生一个排序文件。在有些情况下， 需要控制某个特定行应该到哪个 reducer ，这通常是为了进行后续的聚集操作。distribute by 刚好可以做这件事。因此，distribute by 经常和 sort by 配合使用
cluster by : cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。但是排序 只能倒叙排序 ，不能指定排序规则为 ASC 或者 DESC

distribute by 和 sort by 的使用场景

Map 输出的文件大小不均
Reduce 输出文件大小不均
小文件过多
文件超大

获得时间之间天数的差别

SELECT CAST(MONTHS_BETWEEN(date1, date2) AS INT) as numberOfMonthsBetweenDates
FROM table

行列互转

列转行

-- cust_1, month_1, f1, f2, f3
-- 转成
-- cust_1, month_1, f1
-- cust_1, month_1, f2
-- cust_1, month_1, f3

select cust, month, value
from table lateral view explode(array(col3,col4,col5)) orig_table_alias as value;

行转列

行转列链接

-- 通过唯一键将稀疏表的值合并到一起
select resource_id,
     , case when collect_set(quarter_1)[0] is null then 0 else collect_set(quarter_1)[0] end as quarter_1_spends
     , case when collect_set(quarter_2)[0] is null then 0 else collect_set(quarter_2)[0] end as quarter_2_spends
     , case when collect_set(quarter_3)[0] is null then 0 else collect_set(quarter_3)[0] end as quarter_3_spends
     , case when collect_set(quarter_4)[0] is null then 0 else collect_set(quarter_4)[0] end as quarter_4_spends
from (
  -- 生成一个很多 null 的稀疏表
  select resource_id,
       , case when quarter = 'q1' then amount end as quarter_1
       , case when quarter = 'q2' then amount end as quarter_2
       , case when quarter = 'q3' then amount end as quarter_3
       , case when quarter = 'q4' then amount end as quarter_4
  from billing_info
) tbl1
group by resource_id;

如果生成数据序列

How to generate Date Series in HIVE? (Creating table)

select t.AccountNumber
     , date_add (t.StartDate,pe.i) as Day
from Table1 t
     lateral view
     posexplode(split(space(datediff(t.EndDate, t.StartDate)), ' ')) pe as i,x

选择除了个别字段的全部字段

hive.support.quoted.identifiers=none;
SELECT `(exclude_field)?+.+` FROM <TABLE_NAME>;

beeline 中处理 concat split 等的分号

因为 beeline 中语句的结尾就是分号，如果 concat split 使用了分号，就要将其转化为二进制格式 \073

select split(col, ';') from table;
select split(col, '\073') from table;

避免产生过多的小文件

明确 spark.sql.shuffle.partitions 的文件数

在 sql 语句的后面加上 distribute by rand() ，会将数据随机放到设定的分区中，除了可以避免产生过多的小文件外，还可以使数据更加均匀

insert overwrite table table_name
select *
from table_name_1
distribute by rand();

left semi join 和 inner join 及 left outer join 的区别

left semi join 是查看仅仅会保留主表的记录，不能取右表中的记录，也不会因为右表有多个关联的值而使得主表的记录重复，查看

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

人│生佛魔见

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客