PyMongo——游标迭代

发布于 2024-11-20 00:10:16 字数 993 浏览 1 评论 0原文

我最近开始通过 shell 和 PyMongo 测试 MongoDB。我注意到返回游标并尝试迭代它似乎是实际迭代中的瓶颈。有没有一种方法可以在迭代期间返回多个文档？

伪代码：

for line in file:
    value = line[a:b]
    cursor = collection.find({"field": value})
    for entry in cursor:
        (deal with single entry each time)

我希望做的是这样的：

for line in file
    value = line[a:b]
    cursor = collection.find({"field": value})
    for all_entries in cursor:
        (deal with all entries at once rather than iterate each time)

我尝试按照这个问题并将值一直更改为1000000，但似乎没有任何效果（或者我做错了）。

非常感谢任何帮助。请对这个 Mongo 新手放轻松！

--- 编辑 ---

谢谢凯莱布。我想你已经指出了我真正想问的问题，那就是：有没有办法做某种 collection.findAll() 或者可能 cursor.fetchAll( ) 命令，与 cx_Oracle 模块一样吗？问题不在于存储数据，而在于尽快从 Mongo DB 中检索数据。

据我所知，数据返回给我的速度是由我的网络决定的，因为 Mongo 必须单独获取每条记录，对吗？

原文

I've recently started testing MongoDB via shell and via PyMongo. I've noticed that returning a cursor and trying to iterate over it seems to bottleneck in the actual iteration. Is there a way to return more than one document during iteration?

Pseudo code:

for line in file:
    value = line[a:b]
    cursor = collection.find({"field": value})
    for entry in cursor:
        (deal with single entry each time)

What I'm hoping to do is something like this:

for line in file
    value = line[a:b]
    cursor = collection.find({"field": value})
    for all_entries in cursor:
        (deal with all entries at once rather than iterate each time)

I've tried using batch_size() as per this question and changing the value all the way up to 1000000, but it doesn't seem to have any effect (or I'm doing it wrong).

Any help is greatly appreciated. Please be easy on this Mongo newbie!

--- EDIT ---

Thank you Caleb. I think you've pointed out what I was really trying to ask, which is this: is there any way to do a sort-of collection.findAll() or maybe cursor.fetchAll() command, as there is with the cx_Oracle module? The problem isn't storing the data, but retrieving it from the Mongo DB as fast as possible.

As far as I can tell, the speed at which the data is returned to me is dictated by my network since Mongo has to single-fetch each record, correct?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌伤ぢ 2024-11-27 00:10:16

您是否考虑过类似的方法：

for line in file
  value = line[a:b]
  cursor = collection.find({"field": value})
  entries = cursor[:] # or pull them out with a loop or comprehension -- just get all the docs
  # then process entries as a list, either singly or in batch

或者，类似的方法：

# same loop start
  entries[value] = cursor[:]
# after the loop, all the cursors are out of scope and closed
for value in entries:
  # process entries[value], either singly or in batch

基本上，只要您有足够的 RAM 来存储结果集，您就应该能够将它们从游标上拉下来并在处理之前保留它们。这不太可能明显更快，但它会减轻特别是游标的任何减慢，并让您可以并行处理数据（如果您已做好准备）。

Have you considered an approach like:

for line in file
  value = line[a:b]
  cursor = collection.find({"field": value})
  entries = cursor[:] # or pull them out with a loop or comprehension -- just get all the docs
  # then process entries as a list, either singly or in batch

Alternately, something like:

# same loop start
  entries[value] = cursor[:]
# after the loop, all the cursors are out of scope and closed
for value in entries:
  # process entries[value], either singly or in batch

Basically, as long as you have RAM enough to store your result sets, you should be able to pull them off the cursors and hold onto them before processing. This isn't likely to be significantly faster, but it will mitigate any slowdown specifically of the cursors, and free you to process your data in parallel if you're set up for that.

回复收藏 0 原文

不气馁 2024-11-27 00:10:16

您也可以尝试：

results = list(collection.find({'field':value}))

这应该将所有内容直接加载到 RAM 中。

或者，如果您的文件不是太大，则可能是这样：

values = list()
for line in file:
    values.append(line[a:b])
results = list(collection.find({'field': {'$in': values}}))

You could also try:

results = list(collection.find({'field':value}))

That should load everything right into RAM.

Or this perhaps, if your file is not too huge:

values = list()
for line in file:
    values.append(line[a:b])
results = list(collection.find({'field': {'$in': values}}))

回复收藏 0 原文

-残月青衣踏尘吟 2024-11-27 00:10:16

toArray() 可能是一个解决方案。
根据文档，它首先迭代 Mongo 上的所有游标，并且仅以数组的形式返回结果一次。

http://docs.mongodb.org/manual/reference/method/cursor。 toArray/

这与 list(coll.find()) 或 [doc for doc in coll.find()] 不同，后者将一个文档获取到 Python一次并返回 Mongo 并获取下一个光标。

然而这个方法在pyMongo上并没有实现...奇怪

回复收藏 0 原文

愁以何悠 2024-11-27 00:10:16

正如@jmelesky 上面提到的，我总是遵循相同的方法。这是我的示例代码。为了存储我的光标 twts_result，声明下面的列表以进行复制。如果可以的话，请使用 RAM 来存储数据。如果不需要对您获取数据的集合进行处理和更新，这可以解决光标超时问题。

在这里，我从集合中获取推文。

twts_result = maindb.economy_geolocation.find({}, {'_id' : False})
print "Tweets for processing -> %d" %(twts_result.count())

tweets_sentiment = []
batch_tweets = []
#Copy the cursor data into list
tweets_collection = list(twts_result[:])
for twt in tweets_collection:
    #do stuff here with **twt** data

Like mentioned above by @jmelesky, i always follow same kindof method. Here is my sample code. For storing my cursor twts_result, declaring list below to copy. Make use of RAM if you can to store the data. This solve cursor timeout problem if no processing and updation needded over your collection from where u fetched the data.

Here i am fetching tweets from collection.

twts_result = maindb.economy_geolocation.find({}, {'_id' : False})
print "Tweets for processing -> %d" %(twts_result.count())

tweets_sentiment = []
batch_tweets = []
#Copy the cursor data into list
tweets_collection = list(twts_result[:])
for twt in tweets_collection:
    #do stuff here with **twt** data

回复收藏 0 原文

~没有更多了~