MongoDB不将索引用于时间序列数据

发布于 2025-02-10 05:35:35 字数 2020 浏览 1 评论 0原文

我是MongoDB和NOSQL的新手,我正在尝试在一个庞大的数据集(约5000万个文档)上运行查询,

我正在使用64 GB RAM的Windows 10主机上运行最新版本的MongoDB。

我正在使用pymongo导入数据并运行查询,并且我还具有mongo express作为Docker容器运行以查看导入的数据。

我的创建时间序列集合的语句是:

mydb.command('create', 'sensor_data', timeseries={
    'timeField': 'collection_time', 
    'metaField': 'sensor' 
})

每个文档看起来都这样:

{
    "sensor": { "id": 1, "location":"Somewhere"},
    "collection_time": datetime.strptime("2022/01/01 01:23:45 PM", '%Y/%m/%d %I:%M:%S %p'),
    ...
}

我能够使用mongo express来验证数据已正确加载到mongodb中。

然后,我尝试运行以下代码:

res = mycol.find({
    "collection_time": { 
        "$gte": datetime.strptime("2021/01/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p'), 
        "$lte": datetime.strptime("2022/02/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p') 
    }
})

但是此查询需要很长时间才能运行。

运行res.clain()后,我可以看到该操作正在执行colscan,而不是使用索引。我什至尝试在“ Collection_time”上手动创建索引,但查询仍在进行colscan。 我想念什么?

更新1

感谢R2D2的解决方案,我对find()> 进行了工作,但是我无法为gengregate()使用它。 这是我的代码:

res = mycol.aggregate([
    { 
        "$match": {
            "collection_time": { 
                "$gte": datetime.strptime("2021/01/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p'), 
                "$lte": datetime.strptime("2022/02/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p') 
            }
        }
    },
    {"$group":
        { 
            "_id": {
                 "year" : { "$year" : "$collection_time" },        
                "month" : { "$month" : "$collection_time" },        
                "day" : { "$dayOfMonth" : "$collection_time" },
            }, 
            "count":{ "$sum": 1}
        }
    }
], {hint: "collection_time_1" })

这给出了错误:nameError:名称'提示'未定义

putting in QUATES中的提示给出了错误:attributeError:'dict'objoct:'dict'对象没有属性'_txn_read_preference'

I am new to MongoDB and NoSQL and I am trying to run queries on a huge data set (around 50 million documents)

I am running the latest version of MongoDB using Docker on a Windows 10 host with 64 GB RAM.

I am using pymongo to import the data and run queries and I also have Mongo Express running as a docker container to view the imported data.

My statement to create my time-series collection is:

mydb.command('create', 'sensor_data', timeseries={
    'timeField': 'collection_time', 
    'metaField': 'sensor' 
})

Each document looks something like this:

{
    "sensor": { "id": 1, "location":"Somewhere"},
    "collection_time": datetime.strptime("2022/01/01 01:23:45 PM", '%Y/%m/%d %I:%M:%S %p'),
    ...
}

I have been able to use Mongo Express to verify that the data has been loaded into MongoDB correctly.

I then tried to run the following bit of code:

res = mycol.find({
    "collection_time": { 
        "$gte": datetime.strptime("2021/01/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p'), 
        "$lte": datetime.strptime("2022/02/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p') 
    }
})

But this query takes a very long time to run.

After running res.explain() , I can see that the operation is doing a COLSCAN and not using an index. I even tried manually creating an index on 'collection_time' but the query is still doing a COLSCAN.
What am I missing?

Update 1

Thanks to R2D2's solution, I have got this working for find(), but I can't get it to work for aggregate()
Here is my code:

res = mycol.aggregate([
    { 
        "$match": {
            "collection_time": { 
                "$gte": datetime.strptime("2021/01/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p'), 
                "$lte": datetime.strptime("2022/02/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p') 
            }
        }
    },
    {"$group":
        { 
            "_id": {
                 "year" : { "$year" : "$collection_time" },        
                "month" : { "$month" : "$collection_time" },        
                "day" : { "$dayOfMonth" : "$collection_time" },
            }, 
            "count":{ "$sum": 1}
        }
    }
], {hint: "collection_time_1" })

This gives the error: NameError: name 'hint' is not defined

Putting hint in quotes gives the error: AttributeError: 'dict' object has no attribute '_txn_read_preference'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

手心的温暖 2025-02-17 05:35:35

默认情况下,当您创建时间序列集合时,对于存储时间序列数据是有效的,但是没有创建索引,您可以创建辅助索引用于时间序列集来提高查询的性能,如果查询计划者不选择某些创建索引,则可以将shint()添加到查询中使用索引名称(您可以使用db.Collection.getIndexes())获得索引名称

 db.collection.aggregate(pipeline, {hint: "index_name"})

By default when you create Time Series collection it is effective for storing time series data , but there is no indexes created , you can create secondary indexes for Time Series collection to improve performance for queries and if query planner do not select some of the created indexes you can add hint() to the query with the index name ( you can get the indexes names with db.collection.getIndexes() )

For mongodb 3.6+ you can use hint also in aggregation framework as follow:

 db.collection.aggregate(pipeline, {hint: "index_name"})
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文