暂停 Python 生成器
我有一个 python 生成器,它可以生成大量数据,这会消耗大量内存。有没有一种方法可以检测处理后的数据是否已被使用生成器的代码“消耗”,如果是,则暂停直到它被消耗?
def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
if proxy is not None:
proxy = web.ProxyManager(proxy,delay=delay)
pool_size = len(pool_size.records)
work_pool = pool.Pool(pool_size)
partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
for result in work_pool.imap_unordered(partial_grab,urls):
if result:
yield result
运行自:
if __name__ == '__main__':
links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/@href') if link.startswith('http') and 'reddit' not in link)
print '%s links' % len(links)
counter = 1
for url, data in multi_grab(links,pool_size=10):
print 'got', url, counter, len(data)
counter += 1
I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?
def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
if proxy is not None:
proxy = web.ProxyManager(proxy,delay=delay)
pool_size = len(pool_size.records)
work_pool = pool.Pool(pool_size)
partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
for result in work_pool.imap_unordered(partial_grab,urls):
if result:
yield result
run from:
if __name__ == '__main__':
links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/@href') if link.startswith('http') and 'reddit' not in link)
print '%s links' % len(links)
counter = 1
for url, data in multi_grab(links,pool_size=10):
print 'got', url, counter, len(data)
counter += 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
生成器只是产生值。生成器无法知道它们正在做什么。
但是当调用者执行它所做的任何事情时,生成器也会不断暂停。它不会再次执行,直到调用者调用它来获取下一个值。它不在单独的线程或任何东西上运行。听起来您对发电机的工作原理有误解。你能展示一些代码吗?
A generator simply yields values. There's no way for the generator to know what's being done with them.
But the generator also pauses constantly, as the caller does whatever it does. It doesn't execute again until the caller invokes it to get the next value. It doesn't run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?
Python 中生成器的目的是在每次迭代后删除多余的、不需要的对象。唯一一次它会保留这些额外的对象(因此额外的内存)是当这些对象在其他地方被引用时(例如将它们添加到列表中)。确保您没有不必要地保存这些变量。
如果您正在处理多线程/处理,那么您可能想要实现一个可以从中提取数据的队列,以跟踪您正在处理的任务数量。
The point of a generator in Python is to get rid of extra, unneeded objects after each iteration. The only time it will keep those extra objects (and thus extra ram) is when the objects are being referenced somewhere else (such as adding them to a list). Make sure you aren't saving these variables unnecessarily.
If you're dealing with multithreading/processing, then you probably want to implement a Queue that you could pull data from, keeping track of the number of tasks you're processing.
我想您可能正在寻找
yield
函数。在另一个 StackOverflow 问题中进行了解释:Python 中的“yield”关键字有什么作用?I think you may be looking for the
yield
function. Explained in another StackOverflow question: What does the "yield" keyword do in Python?解决方案可能是使用生成器向其中添加数据的队列,而代码的另一部分将从队列中获取数据并对其进行处理。这样您就可以确保内存中同时存在的项不超过
n
个。A solution could be to use a
Queue
to which the generator would add data, while another part of the code would get data from it and process it. This way you could ensure that there is no more thann
items in memory at the same time.