如何比较Hadoop结果
我正在编写一个地图简化程序,以查找包含最多单词的文件。
现在,我能够使用MAP RELAD来查找每个文件中包含的单词数量。但是,我不确定如何在每个文件中存储单词数,然后对其进行比较,然后找到包含使用MAP REDAL的单词最多的文件。
到目前为止,我的想法是:
有几个作业来查找每个文件中的单词数,
file_name | number of words
file_1 5
file_2 10
file_3 15
然后开始另一个作业,在还原器中找到最大的单词数量 最后得到以下结果,
file_3
我想知道:该方法有意义吗?是否有其他方法可以找到包含MAP REDID单词最多单词的文件?
I am writing a map reduce program to find the file that contains the most words.
Now, I am able to use map reduce to find the number of words contained in each file. However, I am unsure how I can store the number of words in each file and then compare it and find the file that contains the most words using map reduce.
My idea so far:
Having several jobs to find the number of words in each file, like this
file_name | number of words
file_1 5
file_2 10
file_3 15
then start another job, in the reducer find the maximum number of words
and finally get the following result
file_3
I wonder: does the approach make sense? Is there any other way to find the file that contains the most words via map reduce?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在MapReduce中找到最小/最大值不是很好的用例。 例如,您需要将数据强加到一个还原器
,例如迭代输入,然后从映射器中写入
,然后迭代还原器中的值,并像任何数组一样找到最大值。您需要将定界符分开,并且具有文件名和“单词数”
Finding minimum/maximum in mapreduce isn't a good use case for it. You need to force data to one reducer
For example, iterate the input, and write from the mapper
Then, iterate the values in the reducer and find the maximum value like you would for any array. You'd need to split out the delimiter and you have both the filename and the "number of words"