如何比较Hadoop结果

发布于 2025-01-23 19:26:52 字数 402 浏览 5 评论 0原文

我正在编写一个地图简化程序，以查找包含最多单词的文件。

现在，我能够使用MAP RELAD来查找每个文件中包含的单词数量。但是，我不确定如何在每个文件中存储单词数，然后对其进行比较，然后找到包含使用MAP REDAL的单词最多的文件。

到目前为止，我的想法是：

有几个作业来查找每个文件中的单词数，

file_name | number of words 
file_1      5
file_2      10
file_3      15

然后开始另一个作业，在还原器中找到最大的单词数量最后得到以下结果，

file_3

我想知道：该方法有意义吗？是否有其他方法可以找到包含MAP REDID单词最多单词的文件？

原文

I am writing a map reduce program to find the file that contains the most words.

Now, I am able to use map reduce to find the number of words contained in each file. However, I am unsure how I can store the number of words in each file and then compare it and find the file that contains the most words using map reduce.

My idea so far:

Having several jobs to find the number of words in each file, like this

file_name | number of words 
file_1      5
file_2      10
file_3      15

then start another job, in the reducer find the maximum number of words
and finally get the following result

file_3

I wonder: does the approach make sense? Is there any other way to find the file that contains the most words via map reduce?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

策马西风 2025-01-30 19:26:52

在MapReduce中找到最小/最大值不是很好的用例。例如，您需要将数据强加到一个还原器

，例如迭代输入，然后从映射器中写入

null, file_1=5
null, file_2=10
null, file_3=15

，然后迭代还原器中的值，并像任何数组一样找到最大值。您需要将定界符分开，并且具有文件名和“单词数”

Finding minimum/maximum in mapreduce isn't a good use case for it. You need to force data to one reducer

For example, iterate the input, and write from the mapper

null, file_1=5
null, file_2=10
null, file_3=15

Then, iterate the values in the reducer and find the maximum value like you would for any array. You'd need to split out the delimiter and you have both the filename and the "number of words"

回复收藏 0 原文

~没有更多了~