“AVG”和“总和” MongoDB 中的功能,有什么建议吗?

发布于 2024-12-03 06:54:27 字数 1222 浏览 0 评论 0原文

我是 MongoDB 的新手,但从我所读到的内容来看,有多种方法可以在 MongoDB 数据库中查找平均值和值的总和,每种方法都有不同的优点和缺点。

我主要要求一种以尽可能高效(快速)的方法找到所选值的总和以及所选值的平均值的方法。

正在查询的集合中的文档类似于此结构(具有许多其他字段):

{
    "_id": ObjectId('4e650107580fd649e5000005'),
    "date_added": ISODate("2011-09-05T00:00:00Z"),
    "value": 1500
}

在我的应用程序中,预先计算诸如总和之类的内容并不总是可能的,因为要求和的值的选择可能会发生变化(基于日期范围 - 例如开始日期和结束日期之间的平均值是多少)。这与预先计算平均值的问题类似。

据我所知,MapReduce 绝对不适合实时(即按需)查找,因此这似乎也是不可能的。

目前我正在以这种方式查询集合:(注意:这是使用pymongo

response = request.db['somecollection'].find(
    {
        'date_added': {
            '$gte': date_start,
            '$lte': date_end
        }
    },
    {
        'value':1
    }
).limit(500)

然后在Python中使用for循环对响应进行计算。 500 个结果的限制是任意的,以防止速度变得太慢。我只检索值,而不检索其他字段。

这是进行此计算的最有效方法,还是有其他方法来完成我需要的?

警告:

  • 我不能使用group函数,因为我将来可能会使用分片
  • 我不能使用MapReduce,因为它是一个将在-上使用的函数用户的飞翔
  • 我无法预先计算很多总和/平均值,因为总和/平均值的值选择几乎总是不同的
  • 我已经浏览了 stackoverflow 和网络,尝试找到有关如何执行此操作的建议事物,而且它是相当开放的

编辑:

我应该指出,从我上面发布的查询返回的文档数量可以是从 1 个文档到数百个文档的任何内容,但可能有最大返回文档数约150(平均约60或70)

I'm a relative newbie to MongoDB, but from what I've read there are various methods to going about finding averages and sums of values in a MongoDB database, with various benefits and drawbacks for each.

I'm primarily asking for a method of finding the sum of a selection of values, and the average of a selection of values, in an as efficient (fast) method possible.

The documents in the collection being queried resemble this structure (with a lot of other fields):

{
    "_id": ObjectId('4e650107580fd649e5000005'),
    "date_added": ISODate("2011-09-05T00:00:00Z"),
    "value": 1500
}

Precalculating things like sums is, in my application, not always possible, because the selection of values to be summed can change (based on date ranges - e.g. between a start date and an end date, what is the average). This is a similar problem with precalculating averages.

From what I've read, MapReduce is definitely not ideal for real-time (i.e. on demand) lookup, so that seems to be out of the question too.

At the moment I'm querying the collection in this way: (note: this is using pymongo)

response = request.db['somecollection'].find(
    {
        'date_added': {
            '$gte': date_start,
            '$lte': date_end
        }
    },
    {
        'value':1
    }
).limit(500)

Then doing the calculation in Python using a for loop over the response. The limit of 500 results is arbitrary, to keep it from become too slow. I'm only retrieving the value, and none of the other fields.

Is this the most efficient method of doing this calculcation, or are there other methods to accomplish what I need?

Caveats:

  • I can't use the group function because I will probably be using sharding in the future
  • I can't use MapReduce because it's a function which will be used on-the-fly by users
  • I can't precalculate a lot of my sums/averages because the selection of values to sum/average is almost always different
  • I have looked around stackoverflow and the web to try and find recommendation on how to do this kind of thing, and it's fairly open-ended

EDIT:

I should point out that the number of documents returned from the query I posted above could be anything from 1 document to hundreds, but will probably have a maximum number of returned documents of about 150 (average of about 60 or 70)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

一刻暧昧 2024-12-10 06:54:27

尝试一下map-reduce,它可能没有你想象的那么慢。我已经用它对一些大型数据集进行实时聚合,虽然有时速度不是快如闪电,但更多时候还是不错的。最好可以过滤掉要聚合的初始数据的大小,例如:

db.collection.mapReduce(m, r, { query : { year: 2011 } });

如果您需要进一步加快速度,请考虑将数据分布在分片集群上。然后,映射缩减处理可以扩展到并行运行的多个分片上。

Give map-reduce a try, it's probably not as slow as you think. I've used it for real-time aggregation over some large data sets, and although it's sometimes not lightning fast, it's more often fine. It's best if you can filter down the size of the initial data you're aggregating, e.g.:

db.collection.mapReduce(m, r, { query : { year: 2011 } });

If you need to speed things up even more, consider distributing the data over a sharded cluster. Then the map-reduce processing can be scaled out across multiple shards running in parallel.

笨笨の傻瓜 2024-12-10 06:54:27

MongoDB 注释

好的,Map/Reduce 和聚合目前存在一些严重的问题。

重要警告:MongoDB 实例只能有一个“javascript 引擎”实例。这意味着您不能在服务器上同时运行两个 Map/Reduce。而且你只得到一个核心来运行map-reduce。

就您正在做的事情而言,您基本上是在“滚动自己的”M/R。缺点是额外的网络流量。好处是您现在可以投入更多核心来解决问题(来自网络服务器)。

您的关键问题

我无法预先计算很多总和/平均值,因为选择的总和/平均值几乎总是不同的

没有用于优化“所有可能”查询的通用方法。如果您希望系统能够对每个范围的每个字段进行求和和聚合,那么您最终会发现一组太大的字段/范围。

“解决”这个问题的方法是减少字段和范围的集合。

因此,保留每日/每小时计数器并对这些计数器求和。至少您可以减少回答查询所需扫描的文档数量。

MongoDB notes

OK, so Map/Reduce and aggregation have some serious issues currently.

Big caveat: the MongoDB instance can only have one "javascript engine" instance. This means that you cannot run two simultaneous Map/Reduces on the server. And you only get one core for running the map-reduce.

In the case of what you are doing, you're basically "rolling your own" M/R. The downside is the extra network traffic. The upside is that you can now throw more cores at the problem (from the web-servers).

Your key question

I can't precalculate a lot of my sums/averages because the selection of values to sum/average is almost always different

There is no general method for optimizing "all possible" queries. If you want the system to be able to sum and aggregate along every field for every range, then you will eventually find a set of fields/ranges that are too big.

The way to "solve" this is to reduce the set of fields and ranges.

So keep daily / hourly counters and sum over those counters. At the least you reduce the number of documents you need to scan in order to answer your query.

神爱温柔 2024-12-10 06:54:27

简单的答案是:

  1. 如果可能的话,预先计算你可以预先计算的一切。
  2. 如果您需要按日期范围聚合数据并且聚合应尽快完成,则使用映射/归约+分片在多台机器上分配计算。

但同时 mongodb 指南说:

使用MapReduce的代价是速度:group并不特别
速度快,但 MapReduce 速度较慢,不应该用于
“即时的。”您将 MapReduce 作为后台作业运行,它会创建一个
结果集合,然后您可以实际查询该集合
时间。

所以听起来 mongodb 并不是实时数据聚合的最佳解决方案。

Simple answer is:

  1. If it possible precalculate everything you can precalculate.
  2. If you need aggregate data by date ranges and aggregation should work as quick as possible then use map/reduce + sharding to distribute calculation across multiple machines.

But in same time mongodb guide say:

The price of using MapReduce is speed: group is not particularly
speedy, but MapReduce is slower and is not supposed to be used in
“real time.” You run MapReduce as a background job, it creates a
collection of results, and then you can query that collection in real
time.

So it sounds like mongodb is not best solution for real time data aggregation.

红墙和绿瓦 2024-12-10 06:54:27

MongoDB 计划在 2.1.1 版(目前计划于 2011 年 11 月 1 日)中获得诸如 sum/avg/min/max 之类的原生聚合函数。有关更多详细信息和状态,请参阅以下位置的问题:https://jira.mongodb.org/browse/SERVER -447

MongoDB is slated to get native aggregation functions for things like sum/avg/min/max in version 2.1.1 (currently slated for Nov 1, 2011). For more detail and status see the issue at: https://jira.mongodb.org/browse/SERVER-447

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文