Hive、hadoop 以及 hive.exec.reducers.max 背后的机制

发布于 2024-10-17 13:34:57 字数 646 浏览 3 评论 0原文

在其他问题的背景下 这里

使用 hive.exec.reducers.max 指令确实让我感到困惑。

从我的角度来看,我认为 hive 致力于某种逻辑,例如,我在所需的查询中有 N # 个块,所以我需要 N 个映射。 NI 需要一些合理的减速器 R 范围,可以是从 R = N / 2 到 R = 1 的任何值。对于我正在编写的 Hive 报告,有 1200 多个地图,在没有任何影响的情况下,Hive 制定了大约 400 个地图的计划减速器这很好,除了我正在开发一个总共只有 70 个减速器的集群。即使使用公平的作业调度程序,这也会导致积压,从而挂起其他作业。因此,我尝试了很多不同的实验,直到找到 hive.exec.reducers.max 并将其设置为 60 之类的值。

结果是,耗时 248 分钟的 Hive 作业在 155 分钟内完成,结果没有任何变化。让我困扰的是,为什么不让 hive 默认为 N 永远不会大于集群减速器容量,并且看到我可以使用一组减少的减速器滚动几 TB 的数据,然后 hive 认为是正确的,最好总是尝试并调整这个计数?

In context of this other question here

Using hive.exec.reducers.max directive has truely baffled me.

From my perspective I thought hive worked on some sort of logic like, I have N # of blocks in a desired query so I need N maps. From N I will need some sensible range of reducers R which can be anywhere from R = N / 2 to R = 1. For the hive report I was working on, there was 1200+ maps and without any influence hive made a plan for about 400 reducers which was fine except I was working on a cluster that only had 70 reducers total. Even with the fair job scheduler, this caused a backlog that would hang up other jobs. So I tried a lot of different experiments until I found hive.exec.reducers.max and set it to something like 60.

The results was that a hive job that took 248 minutes, finished in 155 minutes with no changes in the result. What's bothered me is, why not have hive default to N never being greater then the clusters reducer capacity and seeing as I can roll over several terabytes of data with a reduced set of reducers then what hive thinks is correct, is it better to always try and tweak this count?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

人疚 2024-10-24 13:34:57

您可能想看看(其中讨论了优化插槽数量):http://wiki.apache. org/hadoop/LimitingTaskSlotUsage

这是我对此的看法:

1)理想情况下,Hive 会根据映射任务后生成的预期数据量尝试优化减速器的数量。它期望底层集群配置为支持相同的功能。

2) 关于调整这个计数是否不是一个好主意:

  • 首先让我们尝试分析执行时间从 248 分钟减少到 155 分钟的原因是什么:

案例 1:Hive 使用 400 个减速器
问题:在给定时间点只能运行 70 个减速器。

  • 假设没有 JVM 重用。一次又一次的创建JVM会增加很大的开销。

  • 对此不确定:期望 400 个减速器会导致碎片等问题。假设我知道只有 70 个减速器可以运行,那么我的中间文件存储策略将取决于此。但是,如果有 400 个减速器,整个策略就会被折腾。

案例2:Hive正在使用70个reducer
- 通过设置这个数字可以解决这两个问题。

我想最好设置最大可用减速器的数量。但是,我不是这方面的专家。请专家对此发表评论。

You may want to look at(which talks about optimizing the number of slots): http://wiki.apache.org/hadoop/LimitingTaskSlotUsage

Here is my opinion on the same:

1) Hive would ideally try optimize the number of reducers based on the expected amount of data that gets generated after the map task. It would expect the underlying cluster to be configured to support the same.

2) Regarding whether it may not be a good idea to tweak this count or not:

  • First lets try to analyze what could be the reason for the execution time to come down from 248 minutes to 155 minutes:

Case1: Hive is using 400 reducers
Problem: Only 70 reducers can run at a given point of time.

  • Assuming no JVM reuse. Creation of the JVM's again and again would add a large overhead.

  • Not sure on this: Expecting 400 reducers would cause a problem like fragmentation. As in, suppose I know that only 70 reducers can run then my intermediate file storing strategy would be depend on that. But, with 400 reducers the whole strategy goes for a toss.

Case2: Hive is using 70 reducers
- Both the problems get addressed by setting this number.

I guess its better to set the number of maximum available reducers. But, I am no expert at this. Would let the experts comment on this.

你的往事 2024-10-24 13:34:57

根据我使用 hive 的经验,将 mapred.job.reuse.jvm.num.tasks 设置为一个健康的数字(在我的例子中为 8)有助于处理许多此类临时查询。生成 JVM 大约需要 20-30 秒,因此重用对于寿命较短(< 30 秒)的映射器和化简器有很大帮助。

In my experience with hive setting mapred.job.reuse.jvm.num.tasks to a healthy number (in my case, 8) helps with a lot of these ad-hoc queries. It takes around 20-30 seconds to spawn a JVM, so reuse can help out quite a bit with mappers and reducers that are short lived (< 30 seconds).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文