在合理的时间内使用 mongoDB 检索大量记录

发布于 2024-12-15 15:32:18 字数 1306 浏览 1 评论 0原文

我正在使用 mongoDB 来存储查询日志并获取有关它的一些统计信息。 我存储在 mongoDB 中的对象包含查询文本、日期、 用户,如果用户单击了某些结果等。

现在我正在尝试检索用户在某一天未单击的所有查询 用java。我的代码大约是这样的:

    DBObject query = new BasicDBObject();
    BasicDBObject keys = new BasicDBObject();
    keys.put("Query", 1);
    query.put("Date", new BasicDBObject("$gte", beginning.getTime()).append("$lte", end.getTime()));
    query.put("IsClick", false);
    ...
    DBCursor cur = mongoCollection.find(query, keys).batchSize(5000);

查询的输出包含我需要迭代的大约 20k 记录。 问题是它需要几分钟:(。我认为这不正常。 从服务器日志中我看到:

Wed Nov 16 16:28:40 query db.QueryLogRecordImpl ntoreturn:5000 reslen:252403 nscanned:59260 { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  nreturned:5000 2055ms
Wed Nov 16 16:28:40 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  bytes:232421 nreturned:5000 170ms
Wed Nov 16 16:30:27 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  bytes:128015 nreturned:2661 --> 106059ms

所以检索第一个块需要 2 秒,第二个 0.1 秒,第三个 106 秒!诡异的.. 我尝试更改批处理大小,在 Date 和 IsClick 上创建索引,重新启动计算机:P 但没有办法。我哪里错了?

I'm using mongoDB to store a querylog and get some stats about it.
Objects that I store in mongoDB contains the text of the query, the date,
the user, if the user clicked on some results etc etc.

Now i'm trying to retrieve all the queries not clicked by a user in a certain day
with java. My code is approximately this:

    DBObject query = new BasicDBObject();
    BasicDBObject keys = new BasicDBObject();
    keys.put("Query", 1);
    query.put("Date", new BasicDBObject("$gte", beginning.getTime()).append("$lte", end.getTime()));
    query.put("IsClick", false);
    ...
    DBCursor cur = mongoCollection.find(query, keys).batchSize(5000);

The output of the query contains about 20k records that I need to iterate.
The problem is that it takes minutes :( . I don't think is normal.
From the server log i see:

Wed Nov 16 16:28:40 query db.QueryLogRecordImpl ntoreturn:5000 reslen:252403 nscanned:59260 { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  nreturned:5000 2055ms
Wed Nov 16 16:28:40 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  bytes:232421 nreturned:5000 170ms
Wed Nov 16 16:30:27 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  bytes:128015 nreturned:2661 --> 106059ms

So retrieving the first chunk takes 2 seconds, the second 0.1 seconds, the third 106 seconds!!! weird..
I tried changing the batch size, creating indexes on Date and IsClick, rebooting the machine :P but no way. Where I'm wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

白首有我共你 2024-12-22 15:32:18

这里有几个因素会影响速度。有必要收集一些额外的数据来确定此处的原因。

一些潜在的问题:

  1. 索引:您使用的索引正确吗?您可能应该在 IsClick/Date 上建立索引。这将范围置于第二位,这是正常的建议。请注意,这与 Date/IsClick 上的索引不同,顺序很重要。在查询中尝试使用 .explain() 来查看正在使用哪些索引。
  2. 数据大小:在某些情况下,速度缓慢可能是由数据过多引起的。这可能是文档太多或太大的文档。它也可能是由于试图在一个非常大的大海捞针中找到太多的针而引起的。您带回 252k 数据 (reslen) 和 12k 文档,因此这可能不是问题所在。
  3. 磁盘IO: MongoDB 使用内存映射文件,因此使用大量虚拟内存。如果您的数据多于 RAM,则获取某些文档需要“转到磁盘”。访问磁盘可能是一项非常昂贵的操作。您可以使用 iostat 或 resmon (Windows) 等工具监控磁盘活动来识别“正在访问磁盘”。

根据个人经验,我强烈怀疑#3,可能比#1 更严重。我将首先在运行 .explain() 查询时观察 IO。这应该会很快缩小可能出现问题的范围。

There are several factors here that can affect speed. It will be necessary to gather some extra data to identify the cause here.

Some potential issues:

  1. Indexes: are you using the right indexes? You should probably be indexing on IsClick/Date. That puts the range second which is the normal suggestion. Note that this is different from indexing on Date/IsClick, order is important. Try a .explain() on your query to see what indexes are being used.
  2. Data Size: in some cases, slowness can be caused by too much data. This could be too many documents or too many large documents. It can also be caused by trying to find too many needles in a really large haystack. You are bringing back 252k in data (reslen) and 12k documents, so this is probably not the problem.
  3. Disk IO: MongoDB uses memory-mapped files and therefore uses lots of virtual memory. If you have more data than RAM then fetching certain documents requires "going to disk". Going to disk can be a very expensive operation. You can identify "going to disk" by using tools like iostat or resmon (Windows) to monitor the disk activity.

Based on personal experience, I strongly suspect #3, with a possible exacerbation from #1. I would start with watching the IO while running a .explain() query. This should quickly narrow down the range of possible problems.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文