MapReduce排序迭代器
我正在阅读MapRedcue的源代码,以更好地了解MapReduce的内部机制。当我试图理解映射阶段生成的数据如何合并并发送到reduce函数以进行进一步处理时,我遇到了问题。源代码看起来太复杂,我只想了解它的概念。
我想知道的是在传递给reduce() 函数之前如何对值(作为参数迭代器)进行排序。在MapTask.runOldReducer() 中,它将通过传递RawKeyValueIterator 创建ReduceValuesIterator,其中将调用Merger.merge() 并执行许多操作(例如收集段)。阅读代码后,在我看来,它只是尝试按键排序,并且该键附带的值将被聚合/收集而不会被删除。例如,map()可能会产生
Key Value http://www.abcfood.com/aLink object A http://www.abcfood.com/bLink object B http://www.abcfood.com/cLink object C
然后在reduce()中,
Key将为http://www.abcfood.com/< /a> 和 Values 将包含对象 A、对象 B 和对象 C。
因此它是按键 http:// 排序的www.abcfood.com/?这是正确的吗?或者它对什么进行排序然后传递给reduce函数?
非常感谢。
I am reading the source code of MapRedcue to gain more understanding MapReduce's internal mechanism. And I have problem when trying to understand how data produced in map phase are merged and sent to reduce function for further processing. The source code looks too complicated and I just want to know its concepts.
What I want to know is how the values (as parameter Iterator) are sorted before passing to reduce() function. Within MapTask.runOldReducer() it will create ReduceValuesIterator by passing RawKeyValueIterator, where Merger.merge() will get called and lots of actions will be performed (e.g. collect segments). After reading code, it seems to me it only tries to sort by key and the values accompanied with that key will be aggregated/ collected without being removed. For instance, map() may produce
Key Value http://www.abcfood.com/aLink object A http://www.abcfood.com/bLink object B http://www.abcfood.com/cLink object C
Then in reduce(),
Key will be http://www.abcfood.com/ and Values will contain object A, object B, and object C.
So it is sorted by the key http://www.abcfood.com/? Is this correct? Or what is it sorted and then passed to reduce function?
Many thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设这是您的输入:
减速器将得到这个:(不保证值的顺序)
assuming this is your input :
the reducer will get this : (there is no guarantee on order of values)
那么是否有可能在reducer中获得有序值呢?
我需要使用排序值(计算通过键传递的值之间的差异)。我遇到了这个问题:)
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
我明白在减速器中复制值然后对它们进行排序是不好的。我可以得到内存溢出。在将 KEY + Interable 传递给减速器之前,我会更好地对值进行排序。
So is there any possibility to get ordered values in reducer?
I need to work with sorted values (calculate difference between values passed with key). I've met the problem :)
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
I understand that it's bad to COPY values in reducer and then order them. I can get memory overflow. Il'll be better to sort values is some way BEFORE passing KEY + Interable to reducer.