MapReduce - 字数统计示例有什么好处

发布于 2024-12-04 17:28:57 字数 426 浏览 1 评论 0原文

我想了解 MapReduce 有什么好处,我刚刚第一次阅读了有关它的一些介绍。

他们都使用这个在大量文档中计算单词数的规范示例,但我没有看到好处。以下是我目前的理解,如有错误请指正。

我们指定输入文件(文档)的列表。 MapReduce 库获取此列表并将其在集群中的处理器之间划分。处理器中的每个文档都会传递给映射函数,在本例中该函数返回一个对的列表。

我有点不确定到底发生了什么。 然后,库软件搜索所有不同处理器上的结果集,并将这些对与相同的单词(键)分组在一起。这些组在不同的处理器上收集,并且在该处理器上的每个组上调用reduce。

然后在主节点上收集组合结果。

这是正确的解释吗?

我不明白的是,由于需要对所有结果进行排序以将键分组,为什么不同时计算它找到的键,为什么需要减少?当查找和组合常用密钥的工作量很大时,此过程如何节省时间?

I am trying to understand what is the benefit of MapReduce, I have just read some introductions on it for the first time.

They all use this canonical example of counting words in a large set of documents, but I am not seeing the benefit. The following is my current understanding, correct me if I'm wrong.

We specify a list of input files (documents). The MapReduce library takes this list and divides it between the processors in the cluster. Each document at a processor is passed to the map function, which returns a list of pairs in this case.

Here is where I am a little unsure what exactly happens.
Then the library software searches through the set of results on all the different processors, and groups together those pairs with the same word (key). These groups are collected at different processors, and reduce is called on each group at that processor.

Combined results are then collected on the master node.

Is this the correct interpretation?

What I don't understand is, as it's necessary to sort through all the results to group keys, why not just count the keys it finds at the same time, why is reduce needed at all? How does this process save time when it seems like there is a lot of work to find and combine common keys?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

森罗 2024-12-11 17:28:57

如果您观看完整的系列,这里有一个关于 MapReduce 算法的 YouTube 视频 中的精彩视频共 5 个视频,它将让您更清楚地了解 MapReduce 并回答您的大部分疑问。

我不明白的是,由于需要对所有结果进行排序以对键进行分组,为什么不直接计算它同时找到的键,为什么需要reduce?当查找和组合通用密钥的工作量很大时,此过程如何节省时间?

因为单词计数示例中的特定单词(例如“sample”)的键/值对可能是由不同的发出的Map 任务将分布在不同的节点上,这些键/值对在发送到reduce 任务之前需要进行合并/排序。特定键的Reduce任务在单个节点上运行并且不是分布式的。

仅供参考,映射任务的结果在与映射任务相同的节点上使用组合器类(与减速器类相同)进行组合,以减少映射器和减速器之间的网络干扰。

Here is a nice video in YouTube Video on MapReduce algorithm, if you watch the complete series of 5 videos it will give you much more clarity on MapReduce and answer most of your queries.

What I don't understand is, as it's necessary to sort through all the results to group keys, why not just count the keys it finds at the same time, why is reduce needed at all? How does this process save time when it seems like there is a lot of work to find and combine common keys?

Because key/value pair for a particular word like "sample" from the word count example might be emitted by different map tasks and will be distributed across different nodes, these key/value pairs need to be consolidated/sorted before sending to the reduce task. Reduce task for a particular key runs on a single node and are not distributed.

FYI, the results from the map task are combined using the combiner class (which is the same as the reducer class) on the same node as the map task to decrease the network chatter between the mappers and the reducers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文