我有一个 MongoDB 集合,其文档使用多个嵌套级别,我想从中提取从其字段的子集编译的多维数组。我现在有一个适合我的解决方案,但我想更好地理解“幂等性”的概念及其与reduce 函数相关的后果。
{
"host_name" : "gateway",
"service_description" : "PING",
"last_update" : 1305777787,
"performance_object" : [
[ "rta", 0.105, "ms", 100, 500, 0 ],
[ "pl", 0, "%", 20, 60, 0 ]
]
}
以下是映射/归约函数
var M = function() {
var hn = this.host_name,
sv = this.service_description,
ts = this.last_update;
this.performance_object.forEach(function(P){
emit( {
host: hn,
service: sv,
metric: P[0]
}, {
time: ts,
value: P[1]
} );
});
}
var R = function(key,values) {
var result = {
time: [],
value: []
};
values.forEach(function(V){
result.time.push(V.time);
result.value.push(V.value);
});
return result;
}
db.runCommand({
mapreduce: <colname>,
out: <col2name>,
map: M,
reduce: R
});
数据以有用的结构返回,我使用 Finalize 重新格式化/排序以进行绘图。
{
"_id" : {
"host" : "localhost",
"service" : "Disk Space",
"metric" : "/var/bck"
},
"value" : {
"time" : [
[ 1306719302, 1306719601, 1306719903, ... ],
[ 1306736404, 1306736703, 1306737002, ... ],
[ 1306766401, 1306766701, 1306767001, ... ]
],
"value" : [
[ 122, 23423, 25654, ... ],
[ 336114, 342511, 349067, ... ],
[ 551196, 551196, 551196, ... ]
]
}
}
最后...
[ [1306719302,122], [1306719601,23423], [1306719903,25654], ... ]
TL;DR:观察到的数组结果“分块”的预期行为是什么?
我知道可以在发出值的数组上多次调用reduce函数,这就是为什么完整数组有多个“块”,而不是单个数组。数组块通常有 25-50 个项目,并且很容易在 Finalize() 中清理它。我 concat() 数组,将它们交错为 [time,value] 并排序。但我真正想知道的是这是否会变得更复杂:
1)观察到分块是因为我的代码、MongoDB 的实现还是 Map/Reduce 算法本身?
2)分片配置中是否会有更深(递归)的数组块嵌套,甚至只是因为我的仓促实现?这会破坏 concat() 方法。
3)是否有更好的策略来获取如上所示的数组结果?
编辑:修改为发出数组:
我采纳了托马斯的建议并将其重写为发出数组。分割这些值绝对没有任何意义。
var M = function() {
var hn = this.host_name,
sv = this.service_description,
ts = this.last_update;
this.performance_object.forEach(function(P){
emit( {
host: hn,
service: sv,
metric: P[0]
}, {
value: [ ts, P[1] ]
} );
});
}
var R = function(key,values) {
var result = {
value: []
};
values.forEach(function(V){
result.value.push(V.value);
});
return result;
}
db.runCommand({
mapreduce: <colname>,
out: <col2name>,
map: M,
reduce: R
});
现在的输出与此类似:
{
"_id" : {
"host" : "localhost",
"service" : "Disk Space",
"metric" : "/var/bck"
},
"value" : {
"value" : [
[ [1306736404,336114],[1306736703,342511],[1306737002,349067], ... ],
[ [1306766401,551196],[1306766701,551196],[1306767001,551196], ... ],
[ [1306719302,122],[1306719601,122],[1306719903,122], ... ]
]
}
}
我使用此 Finalize 函数来连接数组块并对它们进行排序。
...
var F = function(key,values) {
return (Array.concat.apply([],values.value)).sort(function(a,b){
if (a[0] < b[0]) return -1;
if (a[0] > b[0]) return 1;
return 0;
});
}
db.runCommand({
mapreduce: <colname>,
out: <col2name>,
map: M,
reduce: R,
finalize: F
});
效果很好:
{
"_id" : {
"host" : "localhost",
"service" : "Disk Space",
"metric" : "/mnt/bck"
},
"value" : [ [1306719302,122],[1306719601,122],[1306719903,122],, ... ]
}
我想唯一困扰我的问题是是否可以相信这个 Array.concat.apply([],values.value) 可以一直清理 reduce 的输出。
最后编辑:更简单...
自上面给出的原始示例以来,我修改了文档结构,但这只是通过使地图函数变得非常简单来改变示例。
我仍在试图弄清楚为什么 Array.prototype.push.apply(result, V.data) 的工作方式与 result.push(V.data) 如此不同...但它确实有效。
var M = function() {
emit( {
host: this.host,
service: this.service,
metric: this.metric
} , {
data: [ [ this.timestamp, this.data ] ]
} );
}
var R = function(key,values) {
var result = [];
values.forEach(function(V){
Array.prototype.push.apply(result, V.data);
});
return { data: result };
}
var F = function(key,values) {
return values.data.sort(function(a,b){
return (a[0]<b[0]) ? -1 : (a[0]>b[0]) ? 1 : 0;
});
}
它具有与“最后编辑”标题上方所示相同的输出。
谢谢,托马斯!
I have a MongoDB collection, whose docs use several levels of nesting, from which I would like to extract a multidimensional array compiled from a subset of their fields. I have a solution that works for me right now, but I want to better understand this concept of 'idempotency' and its consequences related to the reduce function.
{
"host_name" : "gateway",
"service_description" : "PING",
"last_update" : 1305777787,
"performance_object" : [
[ "rta", 0.105, "ms", 100, 500, 0 ],
[ "pl", 0, "%", 20, 60, 0 ]
]
}
And here are the map/reduce functions
var M = function() {
var hn = this.host_name,
sv = this.service_description,
ts = this.last_update;
this.performance_object.forEach(function(P){
emit( {
host: hn,
service: sv,
metric: P[0]
}, {
time: ts,
value: P[1]
} );
});
}
var R = function(key,values) {
var result = {
time: [],
value: []
};
values.forEach(function(V){
result.time.push(V.time);
result.value.push(V.value);
});
return result;
}
db.runCommand({
mapreduce: <colname>,
out: <col2name>,
map: M,
reduce: R
});
Data is returned in a useful structure, which I reformat/sort with finalize for graphing.
{
"_id" : {
"host" : "localhost",
"service" : "Disk Space",
"metric" : "/var/bck"
},
"value" : {
"time" : [
[ 1306719302, 1306719601, 1306719903, ... ],
[ 1306736404, 1306736703, 1306737002, ... ],
[ 1306766401, 1306766701, 1306767001, ... ]
],
"value" : [
[ 122, 23423, 25654, ... ],
[ 336114, 342511, 349067, ... ],
[ 551196, 551196, 551196, ... ]
]
}
}
Finally...
[ [1306719302,122], [1306719601,23423], [1306719903,25654], ... ]
TL;DR: What is the expected behavior with the oberved "chunking" of the array results?
I understand that the reduce function may be called multiple times on array(s) of emitted values, which is why there are several "chunks" of the complete arrays, rather than a single array. The array chunks are typically 25-50 items and it's easy enough to clean this up in finalize(). I concat() the arrays, interleave them as [time,value] and sort. But what I really want to know is if this can get more complex:
1) Is the chunking observed because of my code, MongoDB's implementation or the Map/Reduce algorithm itself?
2) Will there ever be deeper (recursive) nesting of array chunks in sharded configurations or even just because of my hasty implementation? This would break the concat() method.
3) Is there simply a better strategy for getting array results as shown above?
EDIT: Modified to emit arrays:
I took Thomas' advise and re-wrote it to emit arrays. It absolutely doesn't make any sense to split up the values.
var M = function() {
var hn = this.host_name,
sv = this.service_description,
ts = this.last_update;
this.performance_object.forEach(function(P){
emit( {
host: hn,
service: sv,
metric: P[0]
}, {
value: [ ts, P[1] ]
} );
});
}
var R = function(key,values) {
var result = {
value: []
};
values.forEach(function(V){
result.value.push(V.value);
});
return result;
}
db.runCommand({
mapreduce: <colname>,
out: <col2name>,
map: M,
reduce: R
});
Now the output is similar to this:
{
"_id" : {
"host" : "localhost",
"service" : "Disk Space",
"metric" : "/var/bck"
},
"value" : {
"value" : [
[ [1306736404,336114],[1306736703,342511],[1306737002,349067], ... ],
[ [1306766401,551196],[1306766701,551196],[1306767001,551196], ... ],
[ [1306719302,122],[1306719601,122],[1306719903,122], ... ]
]
}
}
And I used this finalize function to concatenate the array chunks and sort them.
...
var F = function(key,values) {
return (Array.concat.apply([],values.value)).sort(function(a,b){
if (a[0] < b[0]) return -1;
if (a[0] > b[0]) return 1;
return 0;
});
}
db.runCommand({
mapreduce: <colname>,
out: <col2name>,
map: M,
reduce: R,
finalize: F
});
Which works nicely:
{
"_id" : {
"host" : "localhost",
"service" : "Disk Space",
"metric" : "/mnt/bck"
},
"value" : [ [1306719302,122],[1306719601,122],[1306719903,122],, ... ]
}
I guess the only question that's gnawing at me is whether this Array.concat.apply([],values.value) can be trusted to clean up the output of reduce all of the time.
LAST EDIT: Much simpler...
I have modified the document structure since the original example given above, but this only changes the example by making the map function really simple.
I'm still trying to wrap my brain around why Array.prototype.push.apply(result, V.data) works so differently from result.push(V.data)... but it works.
var M = function() {
emit( {
host: this.host,
service: this.service,
metric: this.metric
} , {
data: [ [ this.timestamp, this.data ] ]
} );
}
var R = function(key,values) {
var result = [];
values.forEach(function(V){
Array.prototype.push.apply(result, V.data);
});
return { data: result };
}
var F = function(key,values) {
return values.data.sort(function(a,b){
return (a[0]<b[0]) ? -1 : (a[0]>b[0]) ? 1 : 0;
});
}
It has the same output as shown just above the LAST EDIT heading.
Thanks, Thomas!
发布评论
评论(1)
“分块”来自您的代码:您的reduce函数的values参数可以包含从您的map函数发出的],value:[ 从上次调用您的reduce函数返回。
{time:,value:}
,或者< code>{time:[我不知道它在实践中是否会发生,但它在理论上是可能发生的。
只需让您的map函数发出与您的reduce函数返回相同类型的对象,即
emit(, {time: [ts], value: [P[1]]}),并相应地更改您的reduce函数,即
Array.push.apply(result.time, V.time)
,对于result.value
也类似。好吧,我实际上不明白为什么你不使用时间/值对的数组,而不是一对数组,即
emit(, {pairs: [ {time: ts , value: P[1] ] })
或 map 函数中的emit(, {pairs: [ [ts, P[1]] ] })
,以及在reduce函数中Array.push.apply(result.pairs, V.pairs)
。这样,您甚至不需要 Finalize 函数(除了可能从 pairs 属性中“解开”数组:因为reduce 函数无法返回数组,所以您必须以这种方式将其包装在一个对象)The "chunking" comes from your code: your reduce function's values parameter can contain either
{time:<timestamp>,value:<value>}
emitted from your map function, or{time:[<timestamps>],value:[<values]}
returned from a previous call to your reduce function.I don't know if it will happen in practice, but it can happen in theory.
Simply have your map function emit the same kind of objects that your reduce function returns, i.e.
emit(<id>, {time: [ts], value: [P[1]]})
, and change your reduce function accordingly, i.e.Array.push.apply(result.time, V.time)
and similarly forresult.value
.Well I actually don't understand why you're not using an array of time/value pairs, instead of a pair of arrays, i.e.
emit(<id>, { pairs: [ {time: ts, value: P[1] ] })
oremit(<id>, { pairs: [ [ts, P[1]] ] })
in the map function, andArray.push.apply(result.pairs, V.pairs)
in the reduce function. That way, you won't even need the finalize function (except maybe to "unwrap" the array from the pairs property: because the reduce function cannot return an array, your have to wrap it that way in an object)