与 MongoDB MapReduce 的连接操作

发布于 2024-11-07 15:37:59 字数 140 浏览 0 评论 0原文

我之前一直使用MapReduce来执行经典的MR操作,相当于SQL中的GROUP BY。

我想知道在概念上是否可以使用 MapReduce 执行 JOIN 操作。知道如何实施吗?使用 MapReduce 进行这种操作有意义吗?

谢谢!

I've been using MapReduce before to perform classical MR operation, the equivalent of GROUP BY in SQL.

I was wondering if it would be conceptually possible to perform a JOIN operation with MapReduce. Any idea how that could be implemented? Does it make sense to use MapReduce for this kind of operation?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

骄兵必败 2024-11-14 15:37:59

MongoDB 不支持连接等关系操作。相反,您可以通过将要加入的行嵌入到外部文档中来对数据进行非规范化。因此,您可以拥有一个具有以下架构的 products 集合,而不是将 Products 加入到 Sales:

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
    sales:
    [ 
        { id:1, date: "20100316", howMany: 2 },
        { id:2, date: "20100316", howMany: 5 }
    ]
}

然后,每当您检索产品时,您也会获取其销售数据,因此无需在其他地方加入或查找信息。

或者,您可以像使用关系数据库一样拆分为两个集合,然后使用附加查询来获取产品的销售额,如下所示:

SQL: SELECT Sales WHERE ProductId = 123

MongoDB: db.sales.find( { Productid: 123 } )

产品

{
    _id: 123,
    name: "Widget",
    price: 9.99
}

销售

{
    id: 1,
    productid: 123,
    date: "20100316",
    howMany: 2 
}

{
    id: 2,
    productid: 123,
    date: "20100316",
    howMany: 5
}

MongoDB doesn't support relational operations likes joins. Instead, you can denormalise your data by embedding the rows you'd JOIN on inside the outer document. So instead of joining Products to Sales, you could have a products collection with this schema:

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
    sales:
    [ 
        { id:1, date: "20100316", howMany: 2 },
        { id:2, date: "20100316", howMany: 5 }
    ]
}

Then whenever you retrieve a product, you also get its sales data so there's no need to join or lookup the info somewhere else.

Alternatively, you could split into two collections as you might with a relational database, then use an additional query to get a product's sales, something like this:

SQL: SELECT Sales WHERE ProductId = 123

MongoDB: db.sales.find( { productid: 123 } )

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
}

sales

{
    id: 1,
    productid: 123,
    date: "20100316",
    howMany: 2 
}

{
    id: 2,
    productid: 123,
    date: "20100316",
    howMany: 5
}
≈。彩虹 2024-11-14 15:37:59

我的方法如下:

查看 hadoop 我找到了 CompositeInputFormat 方法
它需要两个或多个集合作为map-reduce作业的输入,

简而言之,根据我的调查, mongodb还没有提供这个。
mongodb mapReduce 一次对一个集合执行。(如果我磨损,请更正)

所以我决定将需要连接的集合放入
在一个集合中,我将为“sql right join”执行mapreduce,

这是来自我的日志报告项目。
第一阶段的映射缩减足以在“无时钟”的情况下执行右连接。
第二阶段map-reduce的目的是排除由时钟字段引起的多余右连接。

db.test.drop();
db.test.insert({"username" : 1, "day" : 1, "clock" : 0 });
db.test.insert({"username" : 1, "day" : 1, "clock" : 1 });
db.test.insert({"username" : 1,  startDay : 1,endDay:2, "table" : "user" });

//startDay : 1,endDay:2 are used to define the employers working day (join to company - left the company)
//you can use an array instedad of array here. for example day:[1,2,3, ...]

m1 = function(){
   if( typeof this.table!= "undefined" && this.table!=null){
       username = this.username;
       startDay = this.startDay;
       endDay   = this.endDay;
       while(startDay<=endDay){
           emit({username:username,day:startDay},{clocks:["join"]});
          // emit({username:username,day:startDay},1);
           startDay++;
       }
   }else{
       emit({username:this.username,day:this.day},{clocks:[this.clock]});
   }
}
r1 = function(key,values){
    result = {clocks:[]}
    values.forEach(function(x){
        result.clocks = x.clocks.concat(result.clocks);
        result.clocks=result.clocks.filter(function(element, index, array){
            return element!="join";            
        })
    })
    return result;
}

db.test.mapReduce(m1,r1,{out:"result1"})
db.test.find();
db.result1.find();

m2=function(){
   key=this._id;
   this.value.clocks.forEach(function(x){
       key.clock=x;
       emit(key,1);       
   })   
}
r2 = function(key,values){
    value=0;
    values.forEach(function(x){
        value+=1;      
    })
    return result;
}

db.result1.mapReduce(m2,r2,{out:"result2"})
db.test.find();
db.result2.find();

My approach is below :

having a look to hadoop I have find CompositeInputFormat approach
brefily, it takes two or more collections as an input for map-reduce job

according to my investigation mongodb dont provide this yet.
mongodb mapReduce is performed on one colletion at a time.(please correct if I am worng)

so I have decided to put the collections that need to be joined
in one collection on wich I will perform the mapreduce for "sql right join"

this is from my log reporter project.
the first phase map-reduce is enough to perform right join in case "no clock".
the second phase map-reduce has the aim to exclude superfluous right join caused by clock field.

db.test.drop();
db.test.insert({"username" : 1, "day" : 1, "clock" : 0 });
db.test.insert({"username" : 1, "day" : 1, "clock" : 1 });
db.test.insert({"username" : 1,  startDay : 1,endDay:2, "table" : "user" });

//startDay : 1,endDay:2 are used to define the employers working day (join to company - left the company)
//you can use an array instedad of array here. for example day:[1,2,3, ...]

m1 = function(){
   if( typeof this.table!= "undefined" && this.table!=null){
       username = this.username;
       startDay = this.startDay;
       endDay   = this.endDay;
       while(startDay<=endDay){
           emit({username:username,day:startDay},{clocks:["join"]});
          // emit({username:username,day:startDay},1);
           startDay++;
       }
   }else{
       emit({username:this.username,day:this.day},{clocks:[this.clock]});
   }
}
r1 = function(key,values){
    result = {clocks:[]}
    values.forEach(function(x){
        result.clocks = x.clocks.concat(result.clocks);
        result.clocks=result.clocks.filter(function(element, index, array){
            return element!="join";            
        })
    })
    return result;
}

db.test.mapReduce(m1,r1,{out:"result1"})
db.test.find();
db.result1.find();

m2=function(){
   key=this._id;
   this.value.clocks.forEach(function(x){
       key.clock=x;
       emit(key,1);       
   })   
}
r2 = function(key,values){
    value=0;
    values.forEach(function(x){
        value+=1;      
    })
    return result;
}

db.result1.mapReduce(m2,r2,{out:"result2"})
db.test.find();
db.result2.find();
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文