如何使用 Map-Reduce 进行查找(或连接)?
如何使用“纯”map-reduce 框架获取输入集
{worker-id:1 name:john supervisor-id:3}
{worker-id:2 name:jane supervisor-id:3}
{worker-id:3 name:bob}
并生成输出集
{worker-id:1 name:john supervisor-name:bob}
{worker-id:2 name:jane supervisor-name:bob}
,即仅具有映射阶段和减少阶段但没有任何额外功能(例如 CouchDB 查找)的框架?
How can I use take the input set
{worker-id:1 name:john supervisor-id:3}
{worker-id:2 name:jane supervisor-id:3}
{worker-id:3 name:bob}
and produce the output set
{worker-id:1 name:john supervisor-name:bob}
{worker-id:2 name:jane supervisor-name:bob}
using a "pure" map-reduce framework, i.e. one with only a map phase and a reduce phase but without any extra feature such as CouchDB's lookup?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
确切的细节取决于您的映射缩减框架。但想法是这样的。在映射阶段,您发出两种类型的键/值对。
(1, {name:john type:boss})
和(3, {worker-id:1 name:john type:worker})
。在归约阶段,您可以将键的所有值分组在一起。如果其中存在 boss 类型的记录,则删除该记录并填充其他记录的主管姓名。如果没有,那么你就把这些记录扔在地板上。基本上,您使用数据按键分组然后在reduce 中一起处理来进行连接的事实。
(在某些map-reduce实现中,您逐渐将键/值对放在reduce中。在这些实现中,您不能丢弃还没有boss的记录,因此您最终需要map-reduce-reduce用于最后的过滤步骤。)
Exact details will depend on your map-reduce framework. But the idea is this. In your map phase, you emit two types of key/value pairs.
(1, {name:john type:boss})
and(3, {worker-id:1 name:john type:worker})
. In your reduce phase you get all of the values for the key grouped together. If there is a record of type boss in there, then you remove that record and populate the supervisor-name of the other records. If there isn't, then you drop those records on the floor.Basically you use the fact that data gets grouped by key then processed together in the reduce to do the join.
(In some map-reduce implementations you incrementally get key/value pairs put together in the reduce. In those implementations you can't throw away records that don't have a boss already, so you wind up needing to map-reduce-reduce for that final filtering step.)
只有一个或多个输入文件?
我的意思是,是否有可能我们有一个文件,其worker-id之一有一个supervisor-id,其描述(该主管的名称-id)在另一个文件中?
There is Only one input file or more??
I mean, is it possible a case which we have a file that one of its worker-id have a supervisor-id which its descriptions(name of that supervisor-id) be in another file??