“AVG”和“总和” MongoDB 中的功能，有什么建议吗？

发布于 2024-12-03 06:54:27 字数 1222 浏览 0 评论 0原文

我是 MongoDB 的新手，但从我所读到的内容来看，有多种方法可以在 MongoDB 数据库中查找平均值和值的总和，每种方法都有不同的优点和缺点。

我主要要求一种以尽可能高效（快速）的方法找到所选值的总和以及所选值的平均值的方法。

正在查询的集合中的文档类似于此结构（具有许多其他字段）：

{
    "_id": ObjectId('4e650107580fd649e5000005'),
    "date_added": ISODate("2011-09-05T00:00:00Z"),
    "value": 1500
}

在我的应用程序中，预先计算诸如总和之类的内容并不总是可能的，因为要求和的值的选择可能会发生变化（基于日期范围 - 例如开始日期和结束日期之间的平均值是多少）。这与预先计算平均值的问题类似。

据我所知，MapReduce 绝对不适合实时（即按需）查找，因此这似乎也是不可能的。

目前我正在以这种方式查询集合：（注意：这是使用pymongo）

response = request.db['somecollection'].find(
    {
        'date_added': {
            '$gte': date_start,
            '$lte': date_end
        }
    },
    {
        'value':1
    }
).limit(500)

然后在Python中使用for循环对响应进行计算。 500 个结果的限制是任意的，以防止速度变得太慢。我只检索值，而不检索其他字段。

这是进行此计算的最有效方法，还是有其他方法来完成我需要的？

警告：

我不能使用group函数，因为我将来可能会使用分片
我不能使用MapReduce，因为它是一个将在-上使用的函数用户的飞翔
我无法预先计算很多总和/平均值，因为总和/平均值的值选择几乎总是不同的
我已经浏览了 stackoverflow 和网络，尝试找到有关如何执行此操作的建议事物，而且它是相当开放的

编辑：

我应该指出，从我上面发布的查询返回的文档数量可以是从 1 个文档到数百个文档的任何内容，但可能有最大返回文档数约150（平均约60或70）

原文

I'm a relative newbie to MongoDB, but from what I've read there are various methods to going about finding averages and sums of values in a MongoDB database, with various benefits and drawbacks for each.

I'm primarily asking for a method of finding the sum of a selection of values, and the average of a selection of values, in an as efficient (fast) method possible.

The documents in the collection being queried resemble this structure (with a lot of other fields):

{
    "_id": ObjectId('4e650107580fd649e5000005'),
    "date_added": ISODate("2011-09-05T00:00:00Z"),
    "value": 1500
}

Precalculating things like sums is, in my application, not always possible, because the selection of values to be summed can change (based on date ranges - e.g. between a start date and an end date, what is the average). This is a similar problem with precalculating averages.

From what I've read, MapReduce is definitely not ideal for real-time (i.e. on demand) lookup, so that seems to be out of the question too.

At the moment I'm querying the collection in this way: (note: this is using pymongo)

response = request.db['somecollection'].find(
    {
        'date_added': {
            '$gte': date_start,
            '$lte': date_end
        }
    },
    {
        'value':1
    }
).limit(500)

Then doing the calculation in Python using a for loop over the response. The limit of 500 results is arbitrary, to keep it from become too slow. I'm only retrieving the value, and none of the other fields.

Is this the most efficient method of doing this calculcation, or are there other methods to accomplish what I need?

Caveats:

I can't use the group function because I will probably be using sharding in the future
I can't use MapReduce because it's a function which will be used on-the-fly by users
I can't precalculate a lot of my sums/averages because the selection of values to sum/average is almost always different
I have looked around stackoverflow and the web to try and find recommendation on how to do this kind of thing, and it's fairly open-ended

EDIT:

I should point out that the number of documents returned from the query I posted above could be anything from 1 document to hundreds, but will probably have a maximum number of returned documents of about 150 (average of about 60 or 70)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一刻暧昧 2024-12-10 06:54:27

尝试一下map-reduce，它可能没有你想象的那么慢。我已经用它对一些大型数据集进行实时聚合，虽然有时速度不是快如闪电，但更多时候还是不错的。最好可以过滤掉要聚合的初始数据的大小，例如：

db.collection.mapReduce(m, r, { query : { year: 2011 } });

如果您需要进一步加快速度，请考虑将数据分布在分片集群上。然后，映射缩减处理可以扩展到并行运行的多个分片上。

Give map-reduce a try, it's probably not as slow as you think. I've used it for real-time aggregation over some large data sets, and although it's sometimes not lightning fast, it's more often fine. It's best if you can filter down the size of the initial data you're aggregating, e.g.:

db.collection.mapReduce(m, r, { query : { year: 2011 } });

If you need to speed things up even more, consider distributing the data over a sharded cluster. Then the map-reduce processing can be scaled out across multiple shards running in parallel.

回复收藏 0 原文