在合理的时间内使用 mongoDB 检索大量记录
我正在使用 mongoDB 来存储查询日志并获取有关它的一些统计信息。 我存储在 mongoDB 中的对象包含查询文本、日期、 用户,如果用户单击了某些结果等。
现在我正在尝试检索用户在某一天未单击的所有查询 用java。我的代码大约是这样的:
DBObject query = new BasicDBObject();
BasicDBObject keys = new BasicDBObject();
keys.put("Query", 1);
query.put("Date", new BasicDBObject("$gte", beginning.getTime()).append("$lte", end.getTime()));
query.put("IsClick", false);
...
DBCursor cur = mongoCollection.find(query, keys).batchSize(5000);
查询的输出包含我需要迭代的大约 20k 记录。 问题是它需要几分钟:(。我认为这不正常。 从服务器日志中我看到:
Wed Nov 16 16:28:40 query db.QueryLogRecordImpl ntoreturn:5000 reslen:252403 nscanned:59260 { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } nreturned:5000 2055ms
Wed Nov 16 16:28:40 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } bytes:232421 nreturned:5000 170ms
Wed Nov 16 16:30:27 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } bytes:128015 nreturned:2661 --> 106059ms
所以检索第一个块需要 2 秒,第二个 0.1 秒,第三个 106 秒!诡异的.. 我尝试更改批处理大小,在 Date 和 IsClick 上创建索引,重新启动计算机:P 但没有办法。我哪里错了?
I'm using mongoDB to store a querylog and get some stats about it.
Objects that I store in mongoDB contains the text of the query, the date,
the user, if the user clicked on some results etc etc.
Now i'm trying to retrieve all the queries not clicked by a user in a certain day
with java. My code is approximately this:
DBObject query = new BasicDBObject();
BasicDBObject keys = new BasicDBObject();
keys.put("Query", 1);
query.put("Date", new BasicDBObject("$gte", beginning.getTime()).append("$lte", end.getTime()));
query.put("IsClick", false);
...
DBCursor cur = mongoCollection.find(query, keys).batchSize(5000);
The output of the query contains about 20k records that I need to iterate.
The problem is that it takes minutes :( . I don't think is normal.
From the server log i see:
Wed Nov 16 16:28:40 query db.QueryLogRecordImpl ntoreturn:5000 reslen:252403 nscanned:59260 { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } nreturned:5000 2055ms
Wed Nov 16 16:28:40 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } bytes:232421 nreturned:5000 170ms
Wed Nov 16 16:30:27 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false } bytes:128015 nreturned:2661 --> 106059ms
So retrieving the first chunk takes 2 seconds, the second 0.1 seconds, the third 106 seconds!!! weird..
I tried changing the batch size, creating indexes on Date and IsClick, rebooting the machine :P but no way. Where I'm wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这里有几个因素会影响速度。有必要收集一些额外的数据来确定此处的原因。
一些潜在的问题:
IsClick/Date
上建立索引。这将范围置于第二位,这是正常的建议。请注意,这与Date/IsClick
上的索引不同,顺序很重要。在查询中尝试使用.explain()
来查看正在使用哪些索引。reslen
) 和 12k 文档,因此这可能不是问题所在。根据个人经验,我强烈怀疑#3,可能比#1 更严重。我将首先在运行
.explain()
查询时观察 IO。这应该会很快缩小可能出现问题的范围。There are several factors here that can affect speed. It will be necessary to gather some extra data to identify the cause here.
Some potential issues:
IsClick/Date
. That puts the range second which is the normal suggestion. Note that this is different from indexing onDate/IsClick
, order is important. Try a.explain()
on your query to see what indexes are being used.reslen
) and 12k documents, so this is probably not the problem.iostat
orresmon
(Windows) to monitor the disk activity.Based on personal experience, I strongly suspect #3, with a possible exacerbation from #1. I would start with watching the IO while running a
.explain()
query. This should quickly narrow down the range of possible problems.