暂停 Python 生成器

发布于 2024-11-28 21:24:33 字数 1020 浏览 1 评论 0原文

我有一个 python 生成器，它可以生成大量数据，这会消耗大量内存。有没有一种方法可以检测处理后的数据是否已被使用生成器的代码“消耗”，如果是，则暂停直到它被消耗？

def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
    if proxy is not None:
        proxy = web.ProxyManager(proxy,delay=delay)
        pool_size = len(pool_size.records)
    work_pool = pool.Pool(pool_size)
    partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
    for result in work_pool.imap_unordered(partial_grab,urls):
        if result:
            yield result

运行自：

if __name__ == '__main__':
    links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/@href') if link.startswith('http') and 'reddit' not in link)
    print '%s links' % len(links)
    counter = 1
    for url, data in multi_grab(links,pool_size=10):
        print 'got', url, counter, len(data)
        counter += 1

原文

I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?

def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
    if proxy is not None:
        proxy = web.ProxyManager(proxy,delay=delay)
        pool_size = len(pool_size.records)
    work_pool = pool.Pool(pool_size)
    partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
    for result in work_pool.imap_unordered(partial_grab,urls):
        if result:
            yield result

run from:

if __name__ == '__main__':
    links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/@href') if link.startswith('http') and 'reddit' not in link)
    print '%s links' % len(links)
    counter = 1
    for url, data in multi_grab(links,pool_size=10):
        print 'got', url, counter, len(data)
        counter += 1

分享到QQ

分享到微博