Hive:动态分区添加到外部表
我正在运行 hive 071,处理具有以下目录布局的现有数据:
-表名
- d=(例如2011-08-01)
-d=2011-08-02
-d=2011-08-03
...等等
在每个日期下我都有日期文件。
现在要加载我正在使用的数据,
CREATE EXTERNAL TABLE table_name (i int)
PARTITIONED BY (date String)
LOCATION '${hiveconf:basepath}/TableName';**
我希望我的配置单元脚本能够根据某些输入日期和天数加载相关分区。所以如果我通过 date='2011-08-03' 和 days='7'
该脚本应加载以下分区 -d=2011-08-03
-d=2011-08-04
-d=2011-08-05
-d=2011-08-06
-d=2011-08-07
-d=2011-08-08
- d=2011-08-09
除了显式运行之外,我还没有找到任何其他方法来做到这一点:
ALTER TABLE table_name ADD PARTITION (d='2011-08-03');
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');
然后运行我的查询,
select count(1) from table_name;
但这当然不是根据日期和天数输入自动执行的
有什么办法我可以定义外部表根据日期范围或日期算术加载分区吗?
I am running hive 071, processing existing data which is has the following directory layout:
-TableName
- d= (e.g. 2011-08-01)
- d=2011-08-02
- d=2011-08-03
... etc
under each date I have the date files.
now to load the data I'm using
CREATE EXTERNAL TABLE table_name (i int)
PARTITIONED BY (date String)
LOCATION '${hiveconf:basepath}/TableName';**
I would like my hive script to be able to load the relevant partitions according to some input date, and number of days. so if I pass date='2011-08-03' and days='7'
The script should load the following partitions
- d=2011-08-03
- d=2011-08-04
- d=2011-08-05
- d=2011-08-06
- d=2011-08-07
- d=2011-08-08
- d=2011-08-09
I havn't found any discent way to do it except explicitlly running:
ALTER TABLE table_name ADD PARTITION (d='2011-08-03');
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');
and then running my query
select count(1) from table_name;
however this is offcourse not automated according to the date and days input
Is there any way I can define to the external table to load partitions according to date range, or date arithmetics?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我有一个非常类似的问题,在迁移之后,我必须重新创建一个拥有数据但没有元数据的表。重新创建表后,解决方案似乎是:
此处说明
这还提到了 < OP 在他自己的帖子中评论了代码>“alter table X Recovery Partitions”。
MSCK REPAIR TABLE table_name;
适用于非 Amazon-EMR 实现(在我的例子中为 Cloudera)。I have a very similar issue where, after a migration, I have to recreate a table for which I have the data, but not the metadata. The solution seems to be, after recreating the table:
Explained here
This also mentions the
"alter table X recover partitions"
that OP commented on his own post.MSCK REPAIR TABLE table_name;
works on non-Amazon-EMR implementations (Cloudera in my case).分区是数据的物理分段 - 其中分区由目录系统维护,查询使用元数据来确定分区所在的位置。所以如果你能让目录结构与查询匹配,它应该找到你想要的数据。例如:
但我不知道任何日期范围操作,否则,您必须先进行数学计算才能创建查询模式。
您还可以创建外部表,并向其中添加定义位置的分区。
这允许您根据需要分解数据,并且仍然使用分区方案来优化查询。
The partitions are a physical segmenting of the data - where the partition is maintained by the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want. for example:
but I do not know of any date-range operations otherwise, you'll have to do the math to create the query pattern first.
you can also create external tables, and add partitions to them that define the location.
This allows you to shred the data as you like, and still use the partition scheme to optimize the queries.
我不相信 Hive 中有任何内置功能。您也许可以编写一个插件。 创建自定义 UDF
可能不需要提及这一点,但是您是否考虑过一个简单的bash 脚本将获取您的参数并将命令通过管道传输到配置单元?
Oozie 工作流程将是另一种选择,但这可能有点过头了。 Oozie Hive 扩展- 经过一番思考,我不认为 Oozie会为此工作。I do not believe there is any built-in functionality for this in Hive. You may be able to write a plugin. Creating custom UDFs
Probably do not need to mention this, but have you considered a simple bash script that would take your parameters and pipe the commands to hive?
Oozie workflows would be another option, however that might be overkill. Oozie Hive Extension- After some thinking I dont think Oozie would work for this.我在我的博客文章中解释了类似的场景:
1)您需要设置属性:
2)创建一个外部临时表以将输入文件数据加载到该表中。
3) 创建一个主生产外部表“product_order”,其中日期字段作为分区列之一。
4) 从暂存表加载生产表,使数据自动分布到分区中。
在下面的博客文章中解释了类似的概念。如果你想看代码。
http://exploredatascience.blogspot.in/2014/06/dynamic -partitioning-with-hive.html
I have explained the similar scenario in my blog post:
1) You need to set properties:
2)Create a external staging table to load the input files data in to this table.
3) Create a main production external table "production_order" with date field as one of the partitioned columns.
4) Load the production table from the staging table so that data is distributed in partitions automatically.
Explained the similar concept in the below blog post. If you want to see the code.
http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html