按现有字段对 Hive 表进行分区？

发布于 2024-11-19 05:06:06 字数 292 浏览 3 评论 0原文

我可以在插入现有字段时对 Hive 表进行分区吗？

我有一个 10 GB 的文件，其中包含日期字段和一天中的小时字段。我可以将此文件加载到表中，然后插入覆盖到另一个使用这些字段作为分区的分区表中吗？会像下面这样工作吗？

INSERT OVERWRITE TABLE tealeaf_event  PARTITION(dt=evt.datestring,hour=evt.hour) 
SELECT * FROM staging_event evt;

谢谢！

特拉维斯

原文

Can I partition a Hive table upon insert by an existing field?

I have a 10 GB file with a date field and an hour of day field. Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition? Would something like the following work?

INSERT OVERWRITE TABLE tealeaf_event  PARTITION(dt=evt.datestring,hour=evt.hour) 
SELECT * FROM staging_event evt;

Thanks!

Travis

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

作死小能手 2024-11-26 05:06:06

我刚刚遇到这个问题，试图回答同样的问题，它很有帮助，但并不完整。简短的答案是肯定的，类似问题中的查询之类的东西可以工作，但语法不太正确。

假设您有使用以下语句创建的三个表：

CREATE TABLE staging_unpartitioned (datestring string, hour int, a int, b int);

CREATE TABLE staging_partitioned (a int, b int) 
    PARTITIONED BY (datestring string, hour int);

CREATE TABLE production_partitioned (a int, b int) 
    PARTITIONED BY (dt string, hour int);

列 a 和 b 只是一些示例列。 dt 和 hour 是我们在到达生产表后要对其进行分区的值。将暂存数据从 staging_unpartitioned 和 staging_partitioned 移至生产环境看起来完全相同。

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_unpartitioned;

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_partitioned;

这使用了一个称为动态分区的过程，您可以阅读这里。需要注意的重要一点是，哪些列与哪些分区关联是由 SELECT 顺序决定的。所有动态分区必须最后按顺序选择。

当您尝试运行上面的代码时，由于您设置的属性，您很可能会遇到错误。首先，如果您禁用了动态分区，它将无法工作，因此请确保：

set hive.exec.dynamic.partition=true;

如果您没有在动态分区之前至少在一个静态分区上进行分区，则可能会遇到错误。当您打算用动态分区覆盖根分区的子分区时，此限制可以避免您意外删除根分区。根据我的经验，这种行为从来没有帮助，而且常常很烦人，但你的情况可能会有所不同。无论如何，改变很容易：

set hive.exec.dynamic.partition.mode=nonstrict;

而且应该可以做到。

I just ran across this trying to answer the same question and it was helpful but not quite complete. The short answer is yes, something like the query in the question will work but the syntax is not quite right.

Say you have three tables which were created using the following statements:

CREATE TABLE staging_unpartitioned (datestring string, hour int, a int, b int);

CREATE TABLE staging_partitioned (a int, b int) 
    PARTITIONED BY (datestring string, hour int);

CREATE TABLE production_partitioned (a int, b int) 
    PARTITIONED BY (dt string, hour int);

Columns a and b are just some example columns. dt and hour are the values we want to partition on once it gets to the production table. Moving the staging data to production from staging_unpartitioned and staging_partitioned looks exactly the same.

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_unpartitioned;

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_partitioned;

This uses a process called Dynamic Partitioning which you can read about here. The important thing to note is that which columns are associated with which partitions is determined by the SELECT order. All dynamic partitions must be selected last and in order.

There's a good chance when you try to run the code above you will hit an error due to the properties you have set. First, it will not work if you have dynamic partitioning disabled so make sure to:

set hive.exec.dynamic.partition=true;

Then you might hit an error if you aren't partitioning on at least one static partition before the dynamic partitions. This restriction would save you accidentally removing a root partition when you meant to overwrite its sub-partitions with dynamic partitions. In my experience this behavior has never been helpful and has often been annoying, but your mileage may vary. At any rate, it is easy to change:

set hive.exec.dynamic.partition.mode=nonstrict;

And that should do it.

回复收藏 0 原文

想你的星星会说话 2024-11-26 05:06:06

也许这已经得到了回答...但是，是的，您可以完全按照您所说的进行操作。我已经做过很多次了。显然，您的新表的定义需要与原始表类似，但没有分区列，并且具有分区规范。另外，我不记得是否必须明确列出原始表中的列，或者星号是否足够。

回复收藏 0 原文

澉约 2024-11-26 05:06:06

我对此不太确定，但这样的东西可能会起作用

INSERT OVERWRITE TABLE tealeaf_event
SELECT col1 as tealeaf_col1, ..., datestring as ds;

I'm not super sure about this, but something like this might work

INSERT OVERWRITE TABLE tealeaf_event
SELECT col1 as tealeaf_col1, ..., datestring as ds;

回复收藏 0 原文

朱染 2024-11-26 05:06:06

不可以。您必须删除该字段，或者至少重命名它。

回复收藏 0 原文

~没有更多了~

关于作者

谁把谁当真

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

按现有字段对 Hive 表进行分区？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

按现有字段对 Hive 表进行分区？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。