“AVG”和“总和” MongoDB 中的功能,有什么建议吗?
我是 MongoDB 的新手,但从我所读到的内容来看,有多种方法可以在 MongoDB 数据库中查找平均值和值的总和,每种方法都有不同的优点和缺点。
我主要要求一种以尽可能高效(快速)的方法找到所选值的总和以及所选值的平均值的方法。
正在查询的集合中的文档类似于此结构(具有许多其他字段):
{
"_id": ObjectId('4e650107580fd649e5000005'),
"date_added": ISODate("2011-09-05T00:00:00Z"),
"value": 1500
}
在我的应用程序中,预先计算诸如总和之类的内容并不总是可能的,因为要求和的值的选择可能会发生变化(基于日期范围 - 例如开始日期和结束日期之间的平均值是多少)。这与预先计算平均值的问题类似。
据我所知,MapReduce 绝对不适合实时(即按需)查找,因此这似乎也是不可能的。
目前我正在以这种方式查询集合:(注意:这是使用pymongo
)
response = request.db['somecollection'].find(
{
'date_added': {
'$gte': date_start,
'$lte': date_end
}
},
{
'value':1
}
).limit(500)
然后在Python中使用for
循环对响应进行计算。 500 个结果的限制是任意的,以防止速度变得太慢。我只检索值,而不检索其他字段。
这是进行此计算的最有效方法,还是有其他方法来完成我需要的?
警告:
- 我不能使用
group
函数,因为我将来可能会使用分片 - 我不能使用MapReduce,因为它是一个将在-上使用的函数用户的飞翔
- 我无法预先计算很多总和/平均值,因为总和/平均值的值选择几乎总是不同的
- 我已经浏览了 stackoverflow 和网络,尝试找到有关如何执行此操作的建议事物,而且它是相当开放的
编辑:
我应该指出,从我上面发布的查询返回的文档数量可以是从 1 个文档到数百个文档的任何内容,但可能有最大返回文档数约150(平均约60或70)
I'm a relative newbie to MongoDB, but from what I've read there are various methods to going about finding averages and sums of values in a MongoDB database, with various benefits and drawbacks for each.
I'm primarily asking for a method of finding the sum of a selection of values, and the average of a selection of values, in an as efficient (fast) method possible.
The documents in the collection being queried resemble this structure (with a lot of other fields):
{
"_id": ObjectId('4e650107580fd649e5000005'),
"date_added": ISODate("2011-09-05T00:00:00Z"),
"value": 1500
}
Precalculating things like sums is, in my application, not always possible, because the selection of values to be summed can change (based on date ranges - e.g. between a start date and an end date, what is the average). This is a similar problem with precalculating averages.
From what I've read, MapReduce is definitely not ideal for real-time (i.e. on demand) lookup, so that seems to be out of the question too.
At the moment I'm querying the collection in this way: (note: this is using pymongo
)
response = request.db['somecollection'].find(
{
'date_added': {
'$gte': date_start,
'$lte': date_end
}
},
{
'value':1
}
).limit(500)
Then doing the calculation in Python using a for
loop over the response. The limit of 500 results is arbitrary, to keep it from become too slow. I'm only retrieving the value, and none of the other fields.
Is this the most efficient method of doing this calculcation, or are there other methods to accomplish what I need?
Caveats:
- I can't use the
group
function because I will probably be using sharding in the future - I can't use MapReduce because it's a function which will be used on-the-fly by users
- I can't precalculate a lot of my sums/averages because the selection of values to sum/average is almost always different
- I have looked around stackoverflow and the web to try and find recommendation on how to do this kind of thing, and it's fairly open-ended
EDIT:
I should point out that the number of documents returned from the query I posted above could be anything from 1 document to hundreds, but will probably have a maximum number of returned documents of about 150 (average of about 60 or 70)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
尝试一下map-reduce,它可能没有你想象的那么慢。我已经用它对一些大型数据集进行实时聚合,虽然有时速度不是快如闪电,但更多时候还是不错的。最好可以过滤掉要聚合的初始数据的大小,例如:
如果您需要进一步加快速度,请考虑将数据分布在分片集群上。然后,映射缩减处理可以扩展到并行运行的多个分片上。
Give map-reduce a try, it's probably not as slow as you think. I've used it for real-time aggregation over some large data sets, and although it's sometimes not lightning fast, it's more often fine. It's best if you can filter down the size of the initial data you're aggregating, e.g.:
If you need to speed things up even more, consider distributing the data over a sharded cluster. Then the map-reduce processing can be scaled out across multiple shards running in parallel.
MongoDB 注释
好的,Map/Reduce 和聚合目前存在一些严重的问题。
重要警告:MongoDB 实例只能有一个“javascript 引擎”实例。这意味着您不能在服务器上同时运行两个 Map/Reduce。而且你只得到一个核心来运行map-reduce。
就您正在做的事情而言,您基本上是在“滚动自己的”M/R。缺点是额外的网络流量。好处是您现在可以投入更多核心来解决问题(来自网络服务器)。
您的关键问题
没有用于优化“所有可能”查询的通用方法。如果您希望系统能够对每个范围的每个字段进行求和和聚合,那么您最终会发现一组太大的字段/范围。
“解决”这个问题的方法是减少字段和范围的集合。
因此,保留每日/每小时计数器并对这些计数器求和。至少您可以减少回答查询所需扫描的文档数量。
MongoDB notes
OK, so Map/Reduce and aggregation have some serious issues currently.
Big caveat: the MongoDB instance can only have one "javascript engine" instance. This means that you cannot run two simultaneous Map/Reduces on the server. And you only get one core for running the map-reduce.
In the case of what you are doing, you're basically "rolling your own" M/R. The downside is the extra network traffic. The upside is that you can now throw more cores at the problem (from the web-servers).
Your key question
There is no general method for optimizing "all possible" queries. If you want the system to be able to sum and aggregate along every field for every range, then you will eventually find a set of fields/ranges that are too big.
The way to "solve" this is to reduce the set of fields and ranges.
So keep daily / hourly counters and sum over those counters. At the least you reduce the number of documents you need to scan in order to answer your query.
简单的答案是:
但同时 mongodb 指南说:
所以听起来 mongodb 并不是实时数据聚合的最佳解决方案。
Simple answer is:
But in same time mongodb guide say:
So it sounds like mongodb is not best solution for real time data aggregation.
MongoDB 计划在 2.1.1 版(目前计划于 2011 年 11 月 1 日)中获得诸如 sum/avg/min/max 之类的原生聚合函数。有关更多详细信息和状态,请参阅以下位置的问题:https://jira.mongodb.org/browse/SERVER -447
MongoDB is slated to get native aggregation functions for things like sum/avg/min/max in version 2.1.1 (currently slated for Nov 1, 2011). For more detail and status see the issue at: https://jira.mongodb.org/browse/SERVER-447