python 中的多处理使用哪种策略

发布于 2024-12-06 14:20:31 字数 583 浏览 1 评论 0原文

我对多处理完全陌生。我一直在阅读有关多处理模块的文档。我读过有关池、线程、队列等的内容，但我完全迷失了。

我想要对多处理做的是，将我简陋的 http 下载器转换为与多个工作人员一起工作。我现在正在做的是，下载一个页面，解析页面以获取有趣的链接。继续，直到下载所有有趣的链接。现在，我想通过多处理来实现这一点。但我目前不知道如何组织这个工作流程。对此我有两个想法。首先，我考虑有两个队列。一个队列用于需要下载的链接，另一个用于要解析的链接。一名工作人员下载页面，并将它们添加到队列中，该队列用于需要解析的项目。其他进程解析页面，并将其发现有趣的链接添加到其他队列。我预计这种方法会出现以下问题：首先，为什么要一次下载一个页面并一次解析一个页面。此外，在耗尽队列中的所有项目后，一个进程如何知道稍后有项目要添加到队列中。

我考虑使用的另一种方法是这样的。有一个函数，可以使用 url 作为参数来调用。此函数下载文档并开始解析其中的链接。每次遇到有趣的链接时，它都会立即创建一个新线程，运行与自身相同的功能。这种方法的问题是，如何跟踪周围产生的所有进程，如何知道是否仍有进程正在运行。另外，如何限制最大进程数。

所以我完全迷失了。任何人都可以提出一个好的策略，也许还可以展示一些关于如何实现这个想法的示例代码。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独难免 2024-12-13 14:20:31

这是一种使用多处理的方法。（非常感谢@Voo，对代码提出了许多改进建议）。

import multiprocessing as mp
import logging
import Queue
import time

logger=mp.log_to_stderr(logging.DEBUG)  # or, 
# logger=mp.log_to_stderr(logging.WARN) # uncomment this to silence debug and info messages

def worker(url_queue,seen):
    while True:
        url=url_queue.get()
        if url not in seen:
            logger.info('downloading {u}'.format(u=url))
            seen[url]=True
            # Replace this with code to dowload url
            # urllib2.open(...)
            time.sleep(0.5)
            content=url
            logger.debug('parsing {c}'.format(c=content))
            # replace this with code that finds interesting links and
            # puts them in url_queue
            for i in range(3):
                if content<5:
                    u=2*content+i-1
                    logger.debug('adding {u} to url_queue'.format(u=u))
                    time.sleep(0.5)
                    url_queue.put(u)
        else:
            logger.debug('skipping {u}; seen before'.format(u=url))
        url_queue.task_done()

if __name__=='__main__':
    num_workers=4
    url_queue=mp.JoinableQueue()
    manager=mp.Manager()
    seen=manager.dict()

    # prime the url queue with at least one url
    url_queue.put(1)
    downloaders=[mp.Process(target=worker,args=(url_queue,seen))
                 for i in range(num_workers)]
    for p in downloaders:
        p.daemon=True
        p.start()
    url_queue.join()

创建 (4) 个工作进程池。
有一个名为url_queue的JoinableQueue。
每个worker从url_queue获取一个url，找到新的url并添加
将它们添加到 url_queue 中。
仅在添加新项目后，它才会调用 url_queue.task_done()。
主进程调用url_queue.join()。这会阻止主要
处理直到为每个任务调用 task_done
url_queue。
由于工作进程的 daemon 属性设置为 True，
当主进程结束时它们也结束。

此示例中使用的所有组件也在 Doug Hellman 的本周优秀 Python 模块教程中进行了解释多处理。

Here is one approach, using multiprocessing. (Many thanks to @Voo, for suggesting many improvements to the code).

import multiprocessing as mp
import logging
import Queue
import time

logger=mp.log_to_stderr(logging.DEBUG)  # or, 
# logger=mp.log_to_stderr(logging.WARN) # uncomment this to silence debug and info messages

def worker(url_queue,seen):
    while True:
        url=url_queue.get()
        if url not in seen:
            logger.info('downloading {u}'.format(u=url))
            seen[url]=True
            # Replace this with code to dowload url
            # urllib2.open(...)
            time.sleep(0.5)
            content=url
            logger.debug('parsing {c}'.format(c=content))
            # replace this with code that finds interesting links and
            # puts them in url_queue
            for i in range(3):
                if content<5:
                    u=2*content+i-1
                    logger.debug('adding {u} to url_queue'.format(u=u))
                    time.sleep(0.5)
                    url_queue.put(u)
        else:
            logger.debug('skipping {u}; seen before'.format(u=url))
        url_queue.task_done()

if __name__=='__main__':
    num_workers=4
    url_queue=mp.JoinableQueue()
    manager=mp.Manager()
    seen=manager.dict()

    # prime the url queue with at least one url
    url_queue.put(1)
    downloaders=[mp.Process(target=worker,args=(url_queue,seen))
                 for i in range(num_workers)]
    for p in downloaders:
        p.daemon=True
        p.start()
    url_queue.join()

A pool of (4) worker processes are created.
There is a JoinableQueue, called url_queue.
Each worker gets a url from the url_queue, finds new urls and adds
them to the url_queue.
Only after adding new items does it call url_queue.task_done().
The main process calls url_queue.join(). This blocks the main
process until task_done has been called for every task in the
url_queue.
Since the worker processes have the daemon attribute set to True,
they too end when the main process ends.