如何在作业完成之前在hadoop中重新运行整个map/reduce?
我使用 Hadoop Map/Reduce 使用 Java
假设,我已经完成了整个 Map/Reduce 工作。有什么方法可以重复整个映射/减少部分,而不结束工作。我的意思是,我不想使用任何不同作业的链接,而只想重复映射/减少部分。
谢谢你!
I using Hadoop Map/Reduce using Java
Suppose, I have completed a whole map/reduce job. Is there any way I could repeat the whole map/reduce part only, without ending the job. I mean, I DON'T want to use any chaining of the different jobs but only only want the map/reduce part to repeat.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
所以我更熟悉 hadoop 流 API,但方法应该转换为本机 API。
根据我的理解,您想要做的就是对输入数据运行相同的 map() 和 reduce() 操作的多次迭代。
假设您的初始map() 输入数据来自文件input.txt,输出文件是output + {iteration}.txt(其中迭代是循环计数,迭代=[0, 迭代次数))。
在第二次调用map()/reduce()时,您的输入文件是output+{iteration},输出文件将变成output+{iteration +1}.txt。
如果不清楚,请告诉我,我可以想出一个简单的示例并在此处发布链接。
编辑*因此,对于Java,我修改了hadoop wordcount示例以运行多次
希望这有帮助
So I am more familiar with hadoop streaming APIs but approach should translate to the native APIs.
In my understanding what you are trying to do is run the several iterations of same map() and reduce() operations on the input data.
Lets say your initial map() input data comes from file input.txt and the output file is output + {iteration}.txt (where iteration is loop count, iteration =[0, # of iteration)).
In the second invocation of the map()/reduce() your input file is output+{iteration} and output file would become output+{iteration +1}.txt.
Let me know if this is not clear, I can conjure up a quick example and post a link here.
EDIT* So for Java I modified the hadoop wordcount example to run multiple times
Hope this helps