并行处理中每个节点所需的结果数
我有一个由单词行组成的数据集(以文件的形式)。我想找到 20 个出现频率更高的单词。这是一个巨大的数据集,所以我正在并行处理它。现在,我将数据集划分为子集,并将它们输入到每个并行工作线程中。然后,每个并行工作人员都会找到每个单词的计数,并返回最常见单词及其计数的列表。然后,所有列表将被聚合,整个数据集中最常见的 20 个单词将根据每个工作人员的结果进行编译。
每个工作人员需要返回多少个单词/计数对才能保证我从整个数据集中获取前 20 个单词?
I have a dataset (in the form of a file) composed of lines of words. I want to find the 20 more frequently occurring words. This is a huge dataset so I am processing this in parallel. Now I would partition the dataset into subsets and feed them into each parallel worker. Each parallel worker would then find the counts of each word and return a list of the most frequent words with their counts. Then all the lists would be aggregated and the top 20 most frequent words out of the entire dataset would be compiled from the results of each worker.
How many word/count pairs does each worker need to return to the aggregator in order to guarantee that I will get the top 20 words out of the entire data set?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要处理所有单词,直到第 20 个和第 21 个最常见单词之间的差异大于剩余未处理单词的数量。
如果您需要对最常用的 20 个单词进行排名,那么您需要处理所有内容。
You need to process all of the words until the difference between the 20th and 21st most frequent words is greater than the number of unprocessed words remaining.
If you need to rank the top 20 most frequent words, then you need to process everything.
这是不可能的。例如,考虑一下您有四个工作人员且工作人员 1 的词频与其他三个工作人员的频率相反的可能性。为了获得准确的计数,工作线程 1 必须返回其整个数据集,以便可以对其进行聚合。
一般情况下,每个worker必须将其所有数据返回到主节点进行聚合。
您正在做的是一个非常典型的映射和减少问题,可以在 MapReduce< /a> 或 Map/Reduce - 可视化解释。
It's impossible to say. Consider, for example, the possibility that you have four workers and worker 1's word frequencies are the inverse of the other three workers' frequencies. In order to get an accurate count, worker 1 will have to return its entire data set so that it can be aggregated.
In the general case, each worker has to return all of its data to the master node for aggregation.
What you're doing is a pretty typical map and reduce problem, a good discussion of which can be found at MapReduce or Map/Reduce - A visual explanation.