按现有字段对 Hive 表进行分区?
我可以在插入现有字段时对 Hive 表进行分区吗?
我有一个 10 GB 的文件,其中包含日期字段和一天中的小时字段。我可以将此文件加载到表中,然后插入覆盖到另一个使用这些字段作为分区的分区表中吗?会像下面这样工作吗?
INSERT OVERWRITE TABLE tealeaf_event PARTITION(dt=evt.datestring,hour=evt.hour)
SELECT * FROM staging_event evt;
谢谢!
特拉维斯
Can I partition a Hive table upon insert by an existing field?
I have a 10 GB file with a date field and an hour of day field. Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition? Would something like the following work?
INSERT OVERWRITE TABLE tealeaf_event PARTITION(dt=evt.datestring,hour=evt.hour)
SELECT * FROM staging_event evt;
Thanks!
Travis
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我刚刚遇到这个问题,试图回答同样的问题,它很有帮助,但并不完整。简短的答案是肯定的,类似问题中的查询之类的东西可以工作,但语法不太正确。
假设您有使用以下语句创建的三个表:
列
a
和b
只是一些示例列。dt
和hour
是我们在到达生产表后要对其进行分区的值。将暂存数据从staging_unpartitioned
和staging_partitioned
移至生产环境看起来完全相同。这使用了一个称为动态分区的过程,您可以阅读这里。需要注意的重要一点是,哪些列与哪些分区关联是由 SELECT 顺序决定的。所有动态分区必须最后按顺序选择。
当您尝试运行上面的代码时,由于您设置的属性,您很可能会遇到错误。首先,如果您禁用了动态分区,它将无法工作,因此请确保:
如果您没有在动态分区之前至少在一个静态分区上进行分区,则可能会遇到错误。当您打算用动态分区覆盖根分区的子分区时,此限制可以避免您意外删除根分区。根据我的经验,这种行为从来没有帮助,而且常常很烦人,但你的情况可能会有所不同。无论如何,改变很容易:
而且应该可以做到。
I just ran across this trying to answer the same question and it was helpful but not quite complete. The short answer is yes, something like the query in the question will work but the syntax is not quite right.
Say you have three tables which were created using the following statements:
Columns
a
andb
are just some example columns.dt
andhour
are the values we want to partition on once it gets to the production table. Moving the staging data to production fromstaging_unpartitioned
andstaging_partitioned
looks exactly the same.This uses a process called Dynamic Partitioning which you can read about here. The important thing to note is that which columns are associated with which partitions is determined by the SELECT order. All dynamic partitions must be selected last and in order.
There's a good chance when you try to run the code above you will hit an error due to the properties you have set. First, it will not work if you have dynamic partitioning disabled so make sure to:
Then you might hit an error if you aren't partitioning on at least one static partition before the dynamic partitions. This restriction would save you accidentally removing a root partition when you meant to overwrite its sub-partitions with dynamic partitions. In my experience this behavior has never been helpful and has often been annoying, but your mileage may vary. At any rate, it is easy to change:
And that should do it.
也许这已经得到了回答...但是,是的,您可以完全按照您所说的进行操作。我已经做过很多次了。显然,您的新表的定义需要与原始表类似,但没有分区列,并且具有分区规范。另外,我不记得是否必须明确列出原始表中的列,或者星号是否足够。
Maybe this is already answered... but yes, you can do exactly as you have stated. I have done it many times. Obviously your new table would need to be defined similar to the original one, but without the partition column, and with the partition specification. Also, I cannot remember if I had to explicitly list out the columns in the original table, or if the asterik was sufficient.
我对此不太确定,但这样的东西可能会起作用
I'm not super sure about this, but something like this might work
不可以。您必须删除该字段,或者至少重命名它。
No. You will have to drop that field or, at least, rename it.