并行化“Reduce” 在“MapReduce”中
我了解 Map 如何轻松并行化 - 每台计算机/CPU 只能在阵列的一小部分上运行。
Reduce/foldl 可以并行吗? 似乎每次计算都依赖于前一次计算。 它只能对某些类型的函数进行并行化吗?
I understand how Map is easily parallelizable - each computer/CPU can just operate on a small portion of the array.
Is Reduce/foldl parallelizable? It seems like each computation depends on the previous one. Is it just parallelizable for certain types of functions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果您的归约基础操作是关联*,您可以调整操作顺序和位置。 因此,在“收集”阶段通常有一个树状结构,因此您可以在对数时间内分多次完成:
而不是 (((a+b)+c)+d)
如果您的操作是可交换的,则进一步优化是可能的,因为您可以以不同的顺序收集(例如,当这些运算是向量运算时,这对于数据对齐可能很重要)
[*]您真正想要的数学运算,当然不是那些有效类型(例如浮点数)的数学运算。
If your reduction underlying operation is associative*, you can play with the order of operations and locality. Therefore you often have a tree-like structure in the 'gather' phase, so you can do it in several passes in logarithmic time:
instead of (((a+b)+c)+d)
If your operation is commutative, further optimization are possible as you can gather in different order (it may be important for data alignment when those operations are vector operations for example)
[*] your real desired mathematical operations, not those on effective types like floats of course.
是的,如果运算符是结合运算符。 例如,您可以对数字列表进行并行求和:
这是可行的,因为 (a+b)+c = a+(b+c),即执行加法的顺序并不重要。
Yes, if the operator is associative. For example, you can parallelise summing a list of numbers:
This works because (a+b)+c = a+(b+c), i.e. the order in which the additions are performed doesn't matter.
查看 Hadoop 中的组合阶段
http://wiki.apache.org/hadoop/HadoopMapReduce
Check out the combine phase in Hadoop
http://wiki.apache.org/hadoop/HadoopMapReduce
不确定您正在考虑什么平台/语言,但您可以像这样并行化归约运算符:
如您所见,并行实现很容易递归。 您将映射拆分,在其自己的线程中对每个部分进行操作,然后在这些线程完成后执行另一个归约以将各个部分组合在一起。
(这是 Piotr Lesnick 的答案背后的编程推理。)
Not sure what platform/language you're thinking of, but you can parallelize reduce operators like this:
As you can see, a parallel implementation is easily recursive. You split the map up, operate on each part in its own thread, then perform another reduce once those threads are done to bring the pieces together.
(This is the programmatic reasoning behind Piotr Lesnick's answer.)
从技术上讲,reduce 与 Foldl(左折叠)不同,后者也可以被描述为累积。
Jules 给出的例子很好地说明了归约操作:
请注意,每一步的结果都是一个数组,包括最终结果,它是一个包含一项的数组。
左折叠如下所示:
现在显然它们都产生相同的结果,但是当给定非关联运算符(如减法)时,foldl 具有明确定义的结果,而归约运算符则不然。
Technically a reduce is not the same as a foldl (fold-left) which can also be described as an accumulate.
The example given by Jules illustrates a reduce operation very well:
Note that at each step the result is an array, including the final result which is an array of one item.
A fold-left is like the following:
Now obviously these both produce the same results, but a foldl has a well-defined result when given a non-associative operator (like subtraction) whereas a reduce operator doesn't.
这取决于您的减少步骤。 在 MapReduce 的 Hadoop 风格实现中,每个键都会调用您的Reducer一次,所有行都与该键相关。
例如,您的 Mapper 可能会接收大量无序的 Web 服务器日志,添加一些元数据(例如地理编码),并发出以 cookie ID 作为键的 [key, record] 对。 然后,您的Reducer将针对每个cookie ID被调用一次,并且将被提供该cookie的所有数据,并且可以计算聚合信息,例如访问频率或每次访问查看的平均页面。 或者,您可以键入地理编码数据,并根据地理位置收集汇总统计数据。
即使您没有进行每个键的聚合分析 - 事实上,即使您在整个集合上计算某些内容 - 也有可能将您的计算分成多个块,每个块都可以输入到一个Reducer。
It depends on your Reduce step. In a Hadoop-style implementation of MapReduce, your Reducer is getting called once per key, with all the rows relevant to that key.
So, for example, your Mapper might be taking in a lot of unordered web server logs, adding some metadata (e.g., geocoding), and emitting [key, record] pairs with a cookie ID as the key. Your Reducer would then be called once per cookie ID and would be fed all the data for that cookie, and could compute aggregate info such as visit frequency or average pages viewed per visit. Or you could key on geocode data, and gather aggregate stats based on geography.
Even if you're not doing per-key aggregate analysis - indeed, even if you're computing something over the whole set - it might be possible to break your computation into chunks, each of which could be fed to a Reducer.