Hive：动态分区添加到外部表

发布于 2024-12-06 03:51:06 字数 1199 浏览 0 评论 0原文

我正在运行 hive 071，处理具有以下目录布局的现有数据：
-表名
- d=（例如2011-08-01）
-d=2011-08-02
-d=2011-08-03
...等等
在每个日期下我都有日期文件。
现在要加载我正在使用的数据，

CREATE EXTERNAL TABLE table_name (i int)  
PARTITIONED BY (date String)  
LOCATION '${hiveconf:basepath}/TableName';**

我希望我的配置单元脚本能够根据某些输入日期和天数加载相关分区。所以如果我通过 date='2011-08-03' 和 days='7'
该脚本应加载以下分区 -d=2011-08-03
-d=2011-08-04
-d=2011-08-05
-d=2011-08-06
-d=2011-08-07
-d=2011-08-08
- d=2011-08-09

除了显式运行之外，我还没有找到任何其他方法来做到这一点：

ALTER TABLE table_name ADD PARTITION (d='2011-08-03');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');

然后运行我的查询，

select count(1) from table_name;

但这当然不是根据日期和天数输入自动执行的

有什么办法我可以定义外部表根据日期范围或日期算术加载分区吗？

原文

I am running hive 071, processing existing data which is has the following directory layout:
-TableName
- d= (e.g. 2011-08-01)
- d=2011-08-02
- d=2011-08-03
... etc
under each date I have the date files.
now to load the data I'm using

CREATE EXTERNAL TABLE table_name (i int)  
PARTITIONED BY (date String)  
LOCATION '${hiveconf:basepath}/TableName';**

I would like my hive script to be able to load the relevant partitions according to some input date, and number of days. so if I pass date='2011-08-03' and days='7'
The script should load the following partitions
- d=2011-08-03
- d=2011-08-04
- d=2011-08-05
- d=2011-08-06
- d=2011-08-07
- d=2011-08-08
- d=2011-08-09

I havn't found any discent way to do it except explicitlly running:

ALTER TABLE table_name ADD PARTITION (d='2011-08-03');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');

and then running my query

select count(1) from table_name;

however this is offcourse not automated according to the date and days input

Is there any way I can define to the external table to load partitions according to date range, or date arithmetics?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

￡烟消云散 2024-12-13 03:51:06

我有一个非常类似的问题，在迁移之后，我必须重新创建一个拥有数据但没有元数据的表。重新创建表后，解决方案似乎是：

MSCK REPAIR TABLE 表名；

此处说明

这还提到了 < OP 在他自己的帖子中评论了代码>“alter table X Recovery Partitions”。 MSCK REPAIR TABLE table_name; 适用于非 Amazon-EMR 实现（在我的例子中为 Cloudera）。

回复收藏 0 原文

木格 2024-12-13 03:51:06

分区是数据的物理分段 - 其中分区由目录系统维护，查询使用元数据来确定分区所在的位置。所以如果你能让目录结构与查询匹配，它应该找到你想要的数据。例如：

select count(*) from table_name where (d >= '2011-08-03) and (d <= '2011-08-09');

但我不知道任何日期范围操作，否则，您必须先进行数学计算才能创建查询模式。

您还可以创建外部表，并向其中添加定义位置的分区。
这允许您根据需要分解数据，并且仍然使用分区方案来优化查询。

The partitions are a physical segmenting of the data - where the partition is maintained by the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want. for example:

select count(*) from table_name where (d >= '2011-08-03) and (d <= '2011-08-09');

but I do not know of any date-range operations otherwise, you'll have to do the math to create the query pattern first.

you can also create external tables, and add partitions to them that define the location.
This allows you to shred the data as you like, and still use the partition scheme to optimize the queries.

回复收藏 0 原文

七七 2024-12-13 03:51:06

我不相信 Hive 中有任何内置功能。您也许可以编写一个插件。创建自定义 UDF

可能不需要提及这一点，但是您是否考虑过一个简单的bash 脚本将获取您的参数并将命令通过管道传输到配置单元？

~~Oozie 工作流程将是另一种选择，但这可能有点过头了。 Oozie Hive 扩展~~ - 经过一番思考，我不认为 Oozie会为此工作。

回复收藏 0 原文

恏ㄋ傷疤忘ㄋ疼 2024-12-13 03:51:06

我在我的博客文章中解释了类似的场景：

1）您需要设置属性：

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

2）创建一个外部临时表以将输入文件数据加载到该表中。

3) 创建一个主生产外部表“product_order”，其中日期字段作为分区列之一。

4) 从暂存表加载生产表，使数据自动分布到分区中。

在下面的博客文章中解释了类似的概念。如果你想看代码。

http://exploredatascience.blogspot.in/2014/06/dynamic -partitioning-with-hive.html

I have explained the similar scenario in my blog post:

1) You need to set properties:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

2)Create a external staging table to load the input files data in to this table.

3) Create a main production external table "production_order" with date field as one of the partitioned columns.

4) Load the production table from the staging table so that data is distributed in partitions automatically.

Explained the similar concept in the below blog post. If you want to see the code.

http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html

回复收藏 0 原文

~没有更多了~

关于作者

迷你仙

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

Hive：动态分区添加到外部表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

Hive：动态分区添加到外部表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。