Hive 中的更新、设置选项

发布于 2024-11-11 15:08:44 字数 365 浏览 2 评论 0原文

我知道 Hadoop 中没有文件更新，但在 Hive 中，可以使用语法糖将新值与表中的旧数据合并，然后使用合并的输出重写表，但如果我在另一个表中有新值那么我可以通过使用左外连接来达到相同的效果。

我的情况的问题是，我必须通过为具有某些 where 条件的列设置一个值来更新表。据了解，不支持SET。

例如，考虑下面的正常查询：

UPDATE temp1
SET location=florida
WHERE id=206;

我尝试在 Hive 中转换相同的查询，但我陷入了 SET 的位置。如果有人让我知道，那对我来说将是一个很大的帮助。

原文

I know there is no update of file in Hadoop but in Hive it is possible with syntactic sugar to merge the new values with the old data in the table and then to rewrite the table with the merged output but if I have the new values in another table then I can achieve the same effect by using a left outer join.

The problem in my case is that I have to update the table by setting one value to a column with some where condition. It is known that SET is not supported.

For example, consider the below normal query:

UPDATE temp1
SET location=florida
WHERE id=206;

I tried to convert the same in Hive but I got stuck in the place of SET. If anyone let me know then it would be a great help for me.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

零度° 2024-11-18 15:08:44

INSERT OVERWRITE TABLE _tableName_ PARTITION (_partitionColumn_= _partitionValue_) 
SELECT [other Things], CASE WHEN id=206 THEN 'florida' ELSE location END AS location, [other Other Things] 
FROM _tableName_ WHERE [_whereClause_];

您可以列出多个分区，并用逗号分隔它们。 ...分区（_partitionColumn_= _partitionValue1_、_partitionColumn_= _partitionValue2_、...）。我还没有对多个分区执行此操作，一次只执行一个，因此我会在一次执行所有分区之前检查测试/开发环境上的结果。我还有其他原因将每个OVERWRITE限制为单个分区。

此页面https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML 还有更多内容。
一般情况下，此网站 https://cwiki.apache.org/confluence/display/Hive/LanguageManual ，是您使用 HiveSQL 时最好的朋友。

我开发了与此相同的东西来迁移一些数据并且它有效。我还没有在大型数据集上尝试过它，只有几 GB 并且它运行得很好。

注意 - 这将覆盖分区。它将使以前的文件成为再见。创建备份和恢复脚本/过程。
[other Things] 和 [other Other Things] 是表中的其余列。它们需要按正确的顺序排列。这非常重要，否则您的数据将被损坏。

希望这有帮助。 :)

INSERT OVERWRITE TABLE _tableName_ PARTITION (_partitionColumn_= _partitionValue_) 
SELECT [other Things], CASE WHEN id=206 THEN 'florida' ELSE location END AS location, [other Other Things] 
FROM _tableName_ WHERE [_whereClause_];

You can have multiple partitions listed by separating them by commas. ... PARTITION (_partitionColumn_= _partitionValue1_, _partitionColumn_= _partitionValue2_, ...). I haven't done this with multiple partitions, just one at a time, so I'd check the results on a test/dev env before doing all partitions at once. I had other reasons for limiting each OVERWRITE to a single partition as well.

This page https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML has a little more on it.
This site https://cwiki.apache.org/confluence/display/Hive/LanguageManual, in general, is your best friend when working with HiveSQL.

I've developed something identical to this to migrate some data and it's worked. I haven't tried it against large datasets, only a few GB and it has worked perfectly.

To Note - This will OVERWRITE the partition. It will make previous files go bye-bye. Create backup and restore scripts/procedures.
The [other Things] and [other Other Things] are the rest of the columns from the table. They need to be in the correct order. This is very important or else your data will be corrupted.

Hope this helps. :)

回复收藏 0 原文

卷耳 2024-11-18 15:08:44

这可能有点hacky，但它对我在工作中必须做的事情很有用。

    INSERT OVERWRITE TABLE tabletop PARTITION(partname = 'valueIwantToSet')
    SELECT things FROM databases.tables WHERE whereclause;

正如您可能期望的那样，这会将您的数据分成多个分区，但是如果您要设置的值的分布与“良好的数据块大小”成正比（这取决于您的设计），那么对该数据的查询将会更好优化的

@Jothi：您能发布您使用的查询吗？

This may be hacky but it's worked for somethings I've had to do at work.

    INSERT OVERWRITE TABLE tabletop PARTITION(partname = 'valueIwantToSet')
    SELECT things FROM databases.tables WHERE whereclause;

As you might expect this breaks your data up into partitions, but if the distribution of the value you want to set is proportional to "good data chunk sizes" (this is up to you to design) then your queries on that data will be better optimized

@Jothi: Could you please post the query you used?

回复收藏 0 原文

~没有更多了~