MapReduce排序迭代器

发布于 2024-12-09 13:38:52 字数 821 浏览 0 评论 0原文

我正在阅读MapRedcue的源代码,以更好地了解MapReduce的内部机制。当我试图理解映射阶段生成的数据如何合并并发送到reduce函数以进行进一步处理时,我遇到了问题。源代码看起来太复杂,我只想了解它的概念。

我想知道的是在传递给reduce() 函数之前如何对值(作为参数迭代器)进行排序。在MapTask.runOldReducer() 中,它将通过传递RawKeyValueIterator 创建ReduceValuesIterator,其中将调用Merger.merge() 并执行许多操作(例如收集段)。阅读代码后,在我看来,它只是尝试按键排序,并且该键附带的值将被聚合/收集而不会被删除。例如,map()可能会产生

    Key                              Value
    http://www.abcfood.com/aLink     object A
    http://www.abcfood.com/bLink     object B
    http://www.abcfood.com/cLink     object C

然后在reduce()中,

Key将为http://www.abcfood.com/< /a> 和 Values 将包含对象 A、对象 B 和对象 C。

因此它是按键 http:// 排序的www.abcfood.com/?这是正确的吗?或者它对什么进行排序然后传递给reduce函数?

非常感谢。

I am reading the source code of MapRedcue to gain more understanding MapReduce's internal mechanism. And I have problem when trying to understand how data produced in map phase are merged and sent to reduce function for further processing. The source code looks too complicated and I just want to know its concepts.

What I want to know is how the values (as parameter Iterator) are sorted before passing to reduce() function. Within MapTask.runOldReducer() it will create ReduceValuesIterator by passing RawKeyValueIterator, where Merger.merge() will get called and lots of actions will be performed (e.g. collect segments). After reading code, it seems to me it only tries to sort by key and the values accompanied with that key will be aggregated/ collected without being removed. For instance, map() may produce

    Key                              Value
    http://www.abcfood.com/aLink     object A
    http://www.abcfood.com/bLink     object B
    http://www.abcfood.com/cLink     object C

Then in reduce(),

Key will be http://www.abcfood.com/ and Values will contain object A, object B, and object C.

So it is sorted by the key http://www.abcfood.com/? Is this correct? Or what is it sorted and then passed to reduce function?

Many thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

椒妓 2024-12-16 13:38:52

假设这是您的输入:

Key                              Value
http://www.example.com/asd       object A
http://www.abcfood.com/aLink     object A
http://www.abcfood.com/bLink     object B
http://www.abcfood.com/cLink     object C
http://www.example.com/t1        object X

减速器将得到这个:(不保证值的顺序)

Key                              Values
http://www.abcfood.com/          [ "object A", "object C", "object B" ]
http://www.example.com/          [ "object X", "object A" ]

assuming this is your input :

Key                              Value
http://www.example.com/asd       object A
http://www.abcfood.com/aLink     object A
http://www.abcfood.com/bLink     object B
http://www.abcfood.com/cLink     object C
http://www.example.com/t1        object X

the reducer will get this : (there is no guarantee on order of values)

Key                              Values
http://www.abcfood.com/          [ "object A", "object C", "object B" ]
http://www.example.com/          [ "object X", "object A" ]
罗罗贝儿 2024-12-16 13:38:52

那么是否有可能在reducer中获得有序值呢?
我需要使用排序值(计算通过键传递的值之间的差异)。我遇到了这个问题:)
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/

我明白在减速器中复制值然后对它们进行排序是不好的。我可以得到内存溢出。在将 KEY + Interable 传递给减速器之前,我会更好地对值进行排序。

So is there any possibility to get ordered values in reducer?
I need to work with sorted values (calculate difference between values passed with key). I've met the problem :)
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/

I understand that it's bad to COPY values in reducer and then order them. I can get memory overflow. Il'll be better to sort values is some way BEFORE passing KEY + Interable to reducer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文