与 MongoDB MapReduce 的连接操作
我之前一直使用MapReduce来执行经典的MR操作,相当于SQL中的GROUP BY。
我想知道在概念上是否可以使用 MapReduce 执行 JOIN 操作。知道如何实施吗?使用 MapReduce 进行这种操作有意义吗?
谢谢!
I've been using MapReduce before to perform classical MR operation, the equivalent of GROUP BY in SQL.
I was wondering if it would be conceptually possible to perform a JOIN operation with MapReduce. Any idea how that could be implemented? Does it make sense to use MapReduce for this kind of operation?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
MongoDB 不支持连接等关系操作。相反,您可以通过将要加入的行嵌入到外部文档中来对数据进行非规范化。因此,您可以拥有一个具有以下架构的
products
集合,而不是将 Products 加入到 Sales:products
然后,每当您检索产品时,您也会获取其销售数据,因此无需在其他地方加入或查找信息。
或者,您可以像使用关系数据库一样拆分为两个集合,然后使用附加查询来获取产品的销售额,如下所示:
SQL:
SELECT Sales WHERE ProductId = 123
MongoDB:
db.sales.find( { Productid: 123 } )
产品
销售
MongoDB doesn't support relational operations likes joins. Instead, you can denormalise your data by embedding the rows you'd JOIN on inside the outer document. So instead of joining Products to Sales, you could have a
products
collection with this schema:products
Then whenever you retrieve a product, you also get its sales data so there's no need to join or lookup the info somewhere else.
Alternatively, you could split into two collections as you might with a relational database, then use an additional query to get a product's sales, something like this:
SQL:
SELECT Sales WHERE ProductId = 123
MongoDB:
db.sales.find( { productid: 123 } )
products
sales
我的方法如下:
查看 hadoop 我找到了 CompositeInputFormat 方法
它需要两个或多个集合作为map-reduce作业的输入,
简而言之,根据我的调查, mongodb还没有提供这个。
mongodb mapReduce 一次对一个集合执行。(如果我磨损,请更正)
所以我决定将需要连接的集合放入
在一个集合中,我将为“sql right join”执行mapreduce,
这是来自我的日志报告项目。
第一阶段的映射缩减足以在“无时钟”的情况下执行右连接。
第二阶段map-reduce的目的是排除由时钟字段引起的多余右连接。
My approach is below :
having a look to hadoop I have find CompositeInputFormat approach
brefily, it takes two or more collections as an input for map-reduce job
according to my investigation mongodb dont provide this yet.
mongodb mapReduce is performed on one colletion at a time.(please correct if I am worng)
so I have decided to put the collections that need to be joined
in one collection on wich I will perform the mapreduce for "sql right join"
this is from my log reporter project.
the first phase map-reduce is enough to perform right join in case "no clock".
the second phase map-reduce has the aim to exclude superfluous right join caused by clock field.