Hadoop 组合器排序阶段
当使用指定的组合器运行 MapReduce 作业时,组合器是否在排序阶段运行?我知道组合器是在每次溢出的映射器输出上运行的,但似乎在合并排序时的中间步骤中运行也是有益的。我在这里假设在排序的某些阶段,某些等效键的映射器输出在某个时刻保存在内存中。
如果目前没有发生这种情况,是否有特殊原因,或者只是尚未实施?
提前致谢!
When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point.
If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
组合器的作用是节省网络带宽。
映射输出直接进行排序:
这发生在真正的映射完成之后。在缓冲区迭代期间,它检查是否设置了组合器,如果是,则组合记录。如果没有,它会直接溢出到磁盘上。
如果您想亲自查看的话,重要部分位于
MapTask
中。这是节省磁盘空间和网络带宽的正确阶段,因为很可能必须传输输出。
在合并/洗牌/排序阶段,这没有什么好处,因为与映射完成时运行的组合器相比,您必须处理更多的数据。
请注意,Web 界面中显示的排序阶段具有误导性。这只是纯粹的合并。
Combiners are there to save network bandwidth.
The mapoutput directly gets sorted:
This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.
The important parts are in the
MapTask
, if you'd like to see it for yourself.This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered.
During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.
Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.
有两种运行组合器的机会,都在处理的映射端。 (一个非常好的在线参考来自 Tom White 的“Hadoop:权威指南” - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort )
第一个机会出现在映射端,在完成每个分区的键的内存排序之后,并将这些排序的数据写入磁盘之前。此时运行组合器的动机是减少最终写入本地存储的数据量。通过在这里运行组合器,我们还减少了下一步需要合并和排序的数据量。因此,对于最初发布的问题,是的,组合器已经在早期步骤中得到应用。
第二个机会出现在合并和排序溢出文件之后。在这种情况下,运行组合器的动机是减少最终通过网络发送到减速器的数据量。这个阶段受益于Combiner的早期应用,可能已经减少了这一步要处理的数据量。
There are two opportunities for running the Combiner, both on the map side of processing. (A very good online reference is from Tom White's "Hadoop: The Definitive Guide" - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort )
The first opportunity comes on the map side after completing the in-memory sort by key of each partition, and before writing those sorted data to disk. The motivation for running the Combiner at this point is to reduce the amount of data ultimately written to local storage. By running the Combiner here, we also reduce the amount of data that will need to be merged and sorted in the next step. So to the original question posted, yes, the Combiner is already being applied at this early step.
The second opportunity comes right after merging and sorting the spill files. In this case, the motivation for running the Combiner is to reduce the amount of data ultimately sent over the network to the reducers. This stage benefits from the earlier application of the Combiner, which may have already reduced the amount of data to be processed by this step.
组合器只会按照您理解的方式运行。
我怀疑组合器仅以这种方式工作的原因是它减少了发送到减速器的数据量。在许多情况下,这都是一个巨大的收获。同时,在化简器中,数据已经存在,无论您在排序/合并中还是在化简逻辑中组合它们,在计算上都并不重要(要么现在完成,要么稍后完成)。
所以,我想我的观点是:你可能会通过合并获得收益,就像你在合并中所说的那样,但它不会像地图端组合器那样多。
The combiner is only going to run how you understand it.
I suspect the reason that the combiner only works in this way is that it reduces the amount of data being sent to the reducers. This is a huge gain in many situations. Meanwhile, in the reducer, the data is already there, and whether you combine them in the sort/merge or in your reduce logic is not really going to matter computationally (it's either done now or later).
So, I guess my point is: you may get gains by combining like you say in the merge, but it's not going to be as much as the map-side combiner.
我没有仔细阅读代码,但参考 Hadoop:Tom White 的权威指南第 3 版,它确实提到如果指定了组合器,它将在减速器的合并阶段运行。以下是文本摘录:
“如果映射输出足够小,则将它们复制到reduce任务JVM的内存中
(缓冲区的大小由mapred.job.shuffle.input.buffer.percent控制,其中
指定用于此目的的堆的比例);否则,它们将被复制
到磁盘。当内存缓冲区达到阈值大小时(由
mapred.job.shuffle.merge.percent),或达到映射输出的阈值数量
(mapred.inmem.merge.threshold),它被合并并溢出到磁盘。 如果指定了组合器,它将在合并期间运行,以减少写入磁盘的数据量。
”
I haven't gone through the code but in reference to Hadoop : The definitive guide by Tom White 3rd edition, it does mention that if the combiner is specified it will run during the merge phase in the reducer. Following is excerpt from the text:
" The map outputs are copied to the reduce task JVM’s memory if they are small enough
(the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which
specifies the proportion of the heap to use for this purpose); otherwise, they are copied
to disk. When the in-memory buffer reaches a threshold size (controlled by
mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs
(mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.
"