PyMongo——游标迭代
我最近开始通过 shell 和 PyMongo 测试 MongoDB。我注意到返回游标并尝试迭代它似乎是实际迭代中的瓶颈。有没有一种方法可以在迭代期间返回多个文档?
伪代码:
for line in file:
value = line[a:b]
cursor = collection.find({"field": value})
for entry in cursor:
(deal with single entry each time)
我希望做的是这样的:
for line in file
value = line[a:b]
cursor = collection.find({"field": value})
for all_entries in cursor:
(deal with all entries at once rather than iterate each time)
我尝试按照 这个问题并将值一直更改为1000000,但似乎没有任何效果(或者我做错了)。
非常感谢任何帮助。请对这个 Mongo 新手放轻松!
--- 编辑 ---
谢谢凯莱布。我想你已经指出了我真正想问的问题,那就是:有没有办法做某种 collection.findAll()
或者可能 cursor.fetchAll( )
命令,与 cx_Oracle 模块一样吗?问题不在于存储数据,而在于尽快从 Mongo DB 中检索数据。
据我所知,数据返回给我的速度是由我的网络决定的,因为 Mongo 必须单独获取每条记录,对吗?
I've recently started testing MongoDB via shell and via PyMongo. I've noticed that returning a cursor and trying to iterate over it seems to bottleneck in the actual iteration. Is there a way to return more than one document during iteration?
Pseudo code:
for line in file:
value = line[a:b]
cursor = collection.find({"field": value})
for entry in cursor:
(deal with single entry each time)
What I'm hoping to do is something like this:
for line in file
value = line[a:b]
cursor = collection.find({"field": value})
for all_entries in cursor:
(deal with all entries at once rather than iterate each time)
I've tried using batch_size() as per this question and changing the value all the way up to 1000000, but it doesn't seem to have any effect (or I'm doing it wrong).
Any help is greatly appreciated. Please be easy on this Mongo newbie!
--- EDIT ---
Thank you Caleb. I think you've pointed out what I was really trying to ask, which is this: is there any way to do a sort-of collection.findAll()
or maybe cursor.fetchAll()
command, as there is with the cx_Oracle module? The problem isn't storing the data, but retrieving it from the Mongo DB as fast as possible.
As far as I can tell, the speed at which the data is returned to me is dictated by my network since Mongo has to single-fetch each record, correct?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您是否考虑过类似的方法:
或者,类似的方法:
基本上,只要您有足够的 RAM 来存储结果集,您就应该能够将它们从游标上拉下来并在处理之前保留它们。这不太可能明显更快,但它会减轻特别是游标的任何减慢,并让您可以并行处理数据(如果您已做好准备)。
Have you considered an approach like:
Alternately, something like:
Basically, as long as you have RAM enough to store your result sets, you should be able to pull them off the cursors and hold onto them before processing. This isn't likely to be significantly faster, but it will mitigate any slowdown specifically of the cursors, and free you to process your data in parallel if you're set up for that.
您也可以尝试:
这应该将所有内容直接加载到 RAM 中。
或者,如果您的
文件
不是太大,则可能是这样:You could also try:
That should load everything right into RAM.
Or this perhaps, if your
file
is not too huge:toArray() 可能是一个解决方案。
根据文档,它首先迭代 Mongo 上的所有游标,并且仅以数组的形式返回结果一次。
http://docs.mongodb.org/manual/reference/method/cursor。 toArray/
这与
list(coll.find())
或[doc for doc in coll.find()]
不同,后者将一个文档获取到 Python一次并返回 Mongo 并获取下一个光标。然而这个方法在pyMongo上并没有实现...奇怪
toArray()
might be a solution.Based on the docs, it first iterates all over the cursors on Mongo and only returns the results once, in the form of an array.
http://docs.mongodb.org/manual/reference/method/cursor.toArray/
This is unlike
list(coll.find())
or[doc for doc in coll.find()]
, which fetch one document to Python at a time and goes back to Mongo and fetch the next cursor.However, this method was not implemented on pyMongo... strange
正如@jmelesky 上面提到的,我总是遵循相同的方法。这是我的示例代码。为了存储我的光标 twts_result,声明下面的列表以进行复制。如果可以的话,请使用 RAM 来存储数据。如果不需要对您获取数据的集合进行处理和更新,这可以解决光标超时问题。
在这里,我从集合中获取推文。
Like mentioned above by @jmelesky, i always follow same kindof method. Here is my sample code. For storing my cursor twts_result, declaring list below to copy. Make use of RAM if you can to store the data. This solve cursor timeout problem if no processing and updation needded over your collection from where u fetched the data.
Here i am fetching tweets from collection.