如何保证MapReduce任务之间相互独立?
我很好奇,MapReduce、Hadoop 等如何将一大块数据分解成独立操作的任务?我很难想象这是怎么回事,考虑到数据之间的相互关联性、任务之间的状态条件等是很常见的。
I'm curious, but how does MapReduce, Hadoop, etc., break a chunk of data into independently operated tasks? I'm having a hard time imagining how that can be, considering it is common to have data that is quite interrelated, with state conditions between tasks, etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果数据是相关的,那么您的工作就是确保信息得到传递。 MapReduce 会分解数据并处理它,而不管任何(未实现的)关系:
Map 只是从输入文件中读取块中的数据,并将它们一次一个“记录”地传递给映射函数。默认记录是一行(但可以修改)。
您可以在 Map 中标注数据的来源,但您基本上可以使用 Map 做的是:对数据进行分类。您发出新键和新值,并按新键进行 MapReduce 分组。因此,如果不同记录之间存在关系:选择相同(或相似的 *1)键来发出它们,因此它们被分组在一起。
对于Reduce,数据被分区/排序(即分组发生的地方),然后reduce函数接收来自一组的所有数据:一个键及其所有关联值。现在您可以聚合这些值。就是这样。
这样你就有了一个由 MapReduce 实现的整体 group-by。其他一切都是你的责任。您想要两个来源的交叉产品吗?例如,通过引入人工密钥和多重发射(片段和复制连接)来实现它。你的想象力是极限。并且:您始终可以通过另一个作业传递数据。
*1:类似,因为您可以影响稍后分组的选择。通常它是组恒等函数,但你可以改变它。
If the data IS related it is your job to ensure that the information is passed along. MapReduce breaks up the data and processes it regardless of any (not implemented) relations:
Map just reads data in blocks from the input files and passes them to the map-function one "record" at a time. Default-record is a line (but can be modified).
You can annotate the data in Map with its origin but what you can basically do with Map is: categorize the data. You emit a new key and new values and MapReduce groups by the new key. So if there are relations between different records: choose the same (or similiar *1) key for emitting them, so they are grouped together.
For Reduce the data is partitioned/sorted (that is where the grouping takes places) and afterwards the reduce-function receives all data from one group: one key and all its associated values. Now you can aggregate over the values. That's it.
So you have an over-all group-by implemented by MapReduce. Everything else is your responsibility. You want a cross product from two sources? Implement it for example by introducing artifical keys and multi-emitting (fragment and replicate join). Your imagination is the limit. And: you can always pass the data through another job.
*1: similiar, because you can influence the choice of grouping later on. normally it is group be identity-function, but you can change this.