hive:压缩运行多长时间?

发布于 2025-01-11 13:06:53 字数 1617 浏览 0 评论 0原文

蜂巢版本:3.1.0.3.1.4.0-315 Spark 版本:2.3.2.3.1.4.0-315

基本上,我正在尝试从 Spark 读取事务表数据。根据此页面[https://stackoverflow.com/questions/50254590/how-to-read-orc-transaction-hive-table-in-spark][1],发现必须压缩事务表。因此,我想尝试这种方法。

我对此很陌生,正在尝试压缩增量文件,但它总是显示“已启动”并且从未完成。 主要压缩和次要压缩都会发生这种情况。任何帮助将不胜感激。

  1. 我想知道这是否是一个好的方法。
  2. 另外,除了显示压缩之外,如何监视压缩作业过程?我只能从 hiveserver_stdout.log 中看到“Compaction enqueued with id 1”这一行。
  3. 一般来说,这个压缩需要多长时间才能完成?
  4. 有什么办法可以停止压缩吗?

TIA。

[编辑]

显示压实;

+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| compactionid  |  dbname   |    tabname     |    partname    |  type  |   state    | workerid  |  starttime  |   duration    | hadoopjobid  |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| CompactionId  | Database  | Table          | Partition      | Type   | State      | Worker    | Start Time  | Duration(ms)  | HadoopJobId  |
| 1             | tmp       | shop_na2       | dt=2014-00-00  | MAJOR  | initiated  |  ---      |  ---        |  ---          |  ---         |
| 2             | tmp       | na2_check      | dt=2014-00-00  | MINOR  | initiated  |  ---      |  ---        |  ---          |  ---         |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
3 rows selected (0.408 seconds)

尽管保留期设置为 86400 秒,但过去 36 小时内仍显示相同的压缩结果。

Hive version: 3.1.0.3.1.4.0-315
spark version: 2.3.2.3.1.4.0-315

Basically, i am trying to read transactional table data from spark. As per this page [https://stackoverflow.com/questions/50254590/how-to-read-orc-transaction-hive-table-in-spark][1], found that transactional table has to be compacted. Hence, i want to try this approach.

I am new to this and was trying compaction on delta files but it always shows "initiated" and never complete.
This is happening for both Major and Minor compaction. Any help will be highly appreciated.

  1. I want to know whether is this good approach.
  2. Also, how to monitor the compaction job process other than show compactions? i can only see the line "Compaction enqueued with id 1" from the hiveserver_stdout.log.
  3. Generally, how long does this compaction takes to complete?
  4. is there any way to stop the compactions?

TIA.

[Edited]

SHOW COMPACTIONS;

+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| compactionid  |  dbname   |    tabname     |    partname    |  type  |   state    | workerid  |  starttime  |   duration    | hadoopjobid  |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| CompactionId  | Database  | Table          | Partition      | Type   | State      | Worker    | Start Time  | Duration(ms)  | HadoopJobId  |
| 1             | tmp       | shop_na2       | dt=2014-00-00  | MAJOR  | initiated  |  ---      |  ---        |  ---          |  ---         |
| 2             | tmp       | na2_check      | dt=2014-00-00  | MINOR  | initiated  |  ---      |  ---        |  ---          |  ---         |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
3 rows selected (0.408 seconds)

The same compactions result has been showing for past 36 hours, though retention period has been set as 86400 sec.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

转身泪倾城 2025-01-18 13:06:53

建议在集群负载较小时执行此操作,也许在周末运行的作业较少时启动,这是一个资源密集型操作,时间量取决于数据,但会跨越适度数量的增量多个小时。您可以使用查询 SHOW COMPACTIONS;获取压缩状态的更新,包括以下详细信息

数据库名称

表名称

分区名称

主要或次要压缩

压缩状态:

已启动 - 在队列中等待

工作 - 当前正在压缩

准备清理 - 压缩已完成,计划删除旧文件

线程 ID

开始压实时间

It is advised to perform this operation when the load on the cluster is less, maybe initiate over a weekend when there are less jobs running, it is a resource intensive operation and amount of time depends on the data but a moderate quantity of deltas would span multiple hours. You can use the query SHOW COMPACTIONS; to get an update on the status of compaction including the following details

Database name

Table name

Partition name

Major or minor compaction

Compaction state:

Initiated - waiting in queue

Working - currently compacting

Ready for cleaning - compaction completed and old files scheduled for removal

Thread ID

Start time of compaction

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文