为什么 Hadoop 中正确的缩减数量是 0.95 或 1.75?

发布于 2024-12-01 22:06:23 字数 266 浏览 0 评论 0原文

hadoop 文档指出:

正确的归约次数似乎是 0.95 或 1.75 乘以 (*mapred.tasktracker.reduce.tasks.maximum)。

有了 0.95,所有的减少都可以立即启动并开始 地图完成时传输地图输出。用1.75更快 节点将完成第一轮缩减并启动第二轮 减少的浪潮在负载平衡方面做得更好。

这些值相当恒定吗?当您选择这些数字之间或之外的值时,结果是什么?

The hadoop documentation states:

The right number of reduces seems to be 0.95 or 1.75 multiplied by
( * mapred.tasktracker.reduce.tasks.maximum).

With 0.95 all of the reduces can launch immediately and start
transferring map outputs as the maps finish. With 1.75 the faster
nodes will finish their first round of reduces and launch a second
wave of reduces doing a much better job of load balancing.

Are these values pretty constant? What are the results when you chose a value between these numbers, or outside of them?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

终陌 2024-12-08 22:06:23

这些价值观应该是您的情况所需要的。 :)

以下是我对这些值的好处的理解:

.95 是允许最大程度地利用可用的减速器。如果 Hadoop 默认使用单个化简器,则不会分配化简,导致其花费的时间比应有的时间长。减速器的增加和时间的减少有近乎线性的拟合(在我有限的情况下)。如果1个reducer需要16分钟,那么8个reducer需要2分钟。

1.75 是一个试图优化节点中机器的性能差异的值。它将创建多个减速器,以便较快的机器将采用额外的减速器,而较慢的机器则不会。
该数字 (1.75) 需要根据您的硬件进行比 0.95 值更多的调整。如果您有 1 台快速机器和 3 台慢速机器,也许您只需要 1.10。这个数字需要更多的实验才能找到适合您的硬件配置的值。如果减速器的数量太多,慢速机器将再次成为瓶颈。

The values should be what your situation needs them to be. :)

The below is my understanding of the benefit of the values:

The .95 is to allow maximum utilization of the available reducers. If Hadoop defaults to a single reducer, there will be no distribution of the reducing, causing it to take longer than it should. There is a near linear fit (in my limited cases) to the increase in reducers and the reduction in time. If it takes 16 minutes on 1 reducer, it takes 2 minutes on 8 reducers.

The 1.75 is a value that attempts to optimize the performance differences o the machines in a node. It will create more than a single pass of reducers so that the faster machines will take on additional reducers while slower machines do not.
This figure (1.75) is one that will need to be adjusted much more to your hardware than the .95 value. If you have 1 quick machine and 3 slower, maybe you'll only want 1.10. This number will need more experimentation to find the value that fits your hardware configuration. If the number of reducers is too high, the slow machines will be the bottleneck again.

陌路黄昏 2024-12-08 22:06:23

添加 Nija 上面所说的内容以及一些个人经验:

0.95 有点意义,因为您正在利用集群的最大容量,但同时,您也需要考虑一些空任务槽位以发生所发生的情况以防您的某些减速机出现故障。如果您使用 1 倍数量的reduce 任务槽,则失败的reduce必须等待至少一个reducer 完成。如果您使用 0.85 或 0.75 个reduce 任务槽,则您没有充分利用集群。

To add to what Nija said above, and also a bit of personal experience:

0.95 makes a bit of sense because you are utilizing the maximum capacity of your cluster, but at the same time, you are accounting for some empty task slots for what happens in case some of your reducers fail. If you're using 1x the number of reduce task slots, your failed reduce has to wait until at least one reducer finishes. If you're using 0.85, or 0.75 of the reduce task slots, you're not utilizing as much of your cluster as you could.

指尖上得阳光 2024-12-08 22:06:23

我们可以说这些数字不再有效。现在根据《Hadoop:权威指南》一书和hadoop wiki,我们的目标是reducer应该通过5分钟。

书中片段:

选择Reducer的数量单个reducer默认是这样的
Hadoop 新用户的一个陷阱。几乎所有现实世界的工作都应该
将其设置为更大的数字;否则,工作会很慢
因为所有中间数据都流经单个reduce 任务。
为作业选择减速器的数量更多的是一门艺术,而不是一个简单的问题。
科学。增加reducer的数量使得reduce阶段
更短,因为你可以获得更多的并行性。但是,如果你也接受这个
到目前为止,您可能有很多小文件,这是次优的。一条规则
经验是目标是每个运行五分钟左右的减速器,
并产生至少一个 HDFS 块的输出。

We can say that these numbers are not valid anymore. Now acording to the book "Hadoop: definitive guide" and hadoop wiki we target that reducer should process by 5 minutes.

Fragment from the book:

Chosing the Number of Reducers The single reducer default is something
of a gotcha for new users to Hadoop. Almost all real-world jobs should
set this to a larger number; otherwise, the job will be very slow
since all the intermediate data flows through a single reduce task.
Choosing the number of reducers for a job is more of an art than a
science. Increasing the number of reducers makes the reduce phase
shorter, since you get more parallelism. However, if you take this too
far, you can have lots of small files, which is suboptimal. One rule
of thumb is to aim for reducers that each run for five minutes or so,
and which produce at least one HDFS block’s worth of output.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文