Hadoop 组合器排序阶段

发布于 2024-12-11 08:53:18 字数 183 浏览 8 评论 0原文

当使用指定的组合器运行 MapReduce 作业时，组合器是否在排序阶段运行？我知道组合器是在每次溢出的映射器输出上运行的，但似乎在合并排序时的中间步骤中运行也是有益的。我在这里假设在排序的某些阶段，某些等效键的映射器输出在某个时刻保存在内存中。

如果目前没有发生这种情况，是否有特殊原因，或者只是尚未实施？

提前致谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

方圜几里 2024-12-18 08:53:18

组合器的作用是节省网络带宽。

映射输出直接进行排序：

sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);

这发生在真正的映射完成之后。在缓冲区迭代期间，它检查是否设置了组合器，如果是，则组合记录。如果没有，它会直接溢出到磁盘上。

如果您想亲自查看的话，重要部分位于 MapTask 中。

    sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
    // some fields
    for (int i = 0; i < partitions; ++i) {
        // check if configured
        if (combinerRunner == null) {
          // spill directly
        } else {
            combinerRunner.combine(kvIter, combineCollector);
        }
    }

这是节省磁盘空间和网络带宽的正确阶段，因为很可能必须传输输出。
在合并/洗牌/排序阶段，这没有什么好处，因为与映射完成时运行的组合器相比，您必须处理更多的数据。

请注意，Web 界面中显示的排序阶段具有误导性。这只是纯粹的合并。

Combiners are there to save network bandwidth.

The mapoutput directly gets sorted:

sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);

This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.

The important parts are in the MapTask, if you'd like to see it for yourself.

    sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
    // some fields
    for (int i = 0; i < partitions; ++i) {
        // check if configured
        if (combinerRunner == null) {
          // spill directly
        } else {
            combinerRunner.combine(kvIter, combineCollector);
        }
    }

This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered.
During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.

Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.

回复收藏 0 原文

許願樹丅啲祈禱 2024-12-18 08:53:18

有两种运行组合器的机会，都在处理的映射端。（一个非常好的在线参考来自 Tom White 的“Hadoop：权威指南” - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort ）

第一个机会出现在映射端，在完成每个分区的键的内存排序之后，并将这些排序的数据写入磁盘之前。此时运行组合器的动机是减少最终写入本地存储的数据量。通过在这里运行组合器，我们还减少了下一步需要合并和排序的数据量。因此，对于最初发布的问题，是的，组合器已经在早期步骤中得到应用。

第二个机会出现在合并和排序溢出文件之后。在这种情况下，运行组合器的动机是减少最终通过网络发送到减速器的数据量。这个阶段受益于Combiner的早期应用，可能已经减少了这一步要处理的数据量。

回复收藏 0 原文

往事风中埋 2024-12-18 08:53:18

组合器只会按照您理解的方式运行。

我怀疑组合器仅以这种方式工作的原因是它减少了发送到减速器的数据量。在许多情况下，这都是一个巨大的收获。同时，在化简器中，数据已经存在，无论您在排序/合并中还是在化简逻辑中组合它们，在计算上都并不重要（要么现在完成，要么稍后完成）。

所以，我想我的观点是：你可能会通过合并获得收益，就像你在合并中所说的那样，但它不会像地图端组合器那样多。

回复收藏 0 原文

旧时模样 2024-12-18 08:53:18

我没有仔细阅读代码，但参考 Hadoop：Tom White 的权威指南第 3 版，它确实提到如果指定了组合器，它将在减速器的合并阶段运行。以下是文本摘录：

“如果映射输出足够小，则将它们复制到reduce任务JVM的内存中
（缓冲区的大小由mapred.job.shuffle.input.buffer.percent控制，其中
指定用于此目的的堆的比例）；否则，它们将被复制
到磁盘。当内存缓冲区达到阈值大小时（由
mapred.job.shuffle.merge.percent)，或达到映射输出的阈值数量
(mapred.inmem.merge.threshold)，它被合并并溢出到磁盘。 如果指定了组合器，它将在合并期间运行，以减少写入磁盘的数据量。
”