python 中的多处理使用哪种策略
我对多处理完全陌生。我一直在阅读有关多处理模块的文档。我读过有关池、线程、队列等的内容,但我完全迷失了。
我想要对多处理做的是,将我简陋的 http 下载器转换为与多个工作人员一起工作。我现在正在做的是,下载一个页面,解析页面以获取有趣的链接。继续,直到下载所有有趣的链接。现在,我想通过多处理来实现这一点。但我目前不知道如何组织这个工作流程。对此我有两个想法。首先,我考虑有两个队列。一个队列用于需要下载的链接,另一个用于要解析的链接。一名工作人员下载页面,并将它们添加到队列中,该队列用于需要解析的项目。其他进程解析页面,并将其发现有趣的链接添加到其他队列。我预计这种方法会出现以下问题:首先,为什么要一次下载一个页面并一次解析一个页面。此外,在耗尽队列中的所有项目后,一个进程如何知道稍后有项目要添加到队列中。
我考虑使用的另一种方法是这样的。有一个函数,可以使用 url 作为参数来调用。此函数下载文档并开始解析其中的链接。每次遇到有趣的链接时,它都会立即创建一个新线程,运行与自身相同的功能。这种方法的问题是,如何跟踪周围产生的所有进程,如何知道是否仍有进程正在运行。另外,如何限制最大进程数。
所以我完全迷失了。任何人都可以提出一个好的策略,也许还可以展示一些关于如何实现这个想法的示例代码。
I am completely new to multiprocessing. I have been reading documentation about multiprocessing module. I read about Pool, Threads, Queues etc. but I am completely lost.
What I want to do with multiprocessing is that, convert my humble http downloader, to work with multiple workers. What I am doing at the moment is, download a page, parse to page to get interesting links. Continue until all interesting links are downloaded. Now, I want to implement this with multiprocessing. But I have no idea at the moment, how to organize this work flow. I had two thoughts about this. Firstly, I thought about having two queues. One queue for links that needs to be downloaded, other for links to be parsed. One worker, downloads the pages, and adds them to queue which is for items that needs to be parsed. And other process parses a page, and adds the links it finds interesting to the other queue. Problems I expect from this approach are; first of all, why download one page at a time and parse a page at a time. Moreover, how do one process know that there are items to be added to queue later, after it exhausted all items from queue.
Another approach I thought about using is that. Have a function, that can be called with an url as an argument. This function downloads the document and starts parsing it for the links. Every time it encounters an interesting link, it instantly creates a new thread running identical function as itself. The problem I have with this approach is, how do I keep track of all the processes spawned all around, how do I know if there is still processes to running. And also, how do I limit maximum number of processes.
So I am completely lost. Can anyone suggest a good strategy, and perhaps show some example codes about how to go with the idea.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一种使用多处理的方法。 (非常感谢@Voo,对代码提出了许多改进建议)。
url_queue
的JoinableQueue
。url_queue
获取一个url,找到新的url并添加将它们添加到
url_queue
中。url_queue.task_done()
。url_queue.join()
。这会阻止主要处理直到为每个任务调用
task_done
url_queue
。daemon
属性设置为 True,当主进程结束时它们也结束。
此示例中使用的所有组件也在 Doug Hellman 的本周优秀 Python 模块教程中进行了解释多处理。
Here is one approach, using multiprocessing. (Many thanks to @Voo, for suggesting many improvements to the code).
JoinableQueue
, calledurl_queue
.url_queue
, finds new urls and addsthem to the
url_queue
.url_queue.task_done()
.url_queue.join()
. This blocks the mainprocess until
task_done
has been called for every task in theurl_queue
.daemon
attribute set to True,they too end when the main process ends.
All the components used in this example are also explained in Doug Hellman's excellent Python Module of the Week tutorial on multiprocessing.
您所描述的本质上是图遍历;大多数图遍历算法(比深度优先更复杂)都会跟踪两组节点,在您的情况下,节点是 url 的。
第一个集合称为“闭集”,表示所有已被访问和处理的节点。如果在处理页面时发现某个链接恰好位于封闭集中,则可以忽略它,因为它已经被处理了。
第二组毫不奇怪地被称为“开集”,并且包括已找到但尚未处理的所有边。
基本机制是首先将根节点放入开集(闭集最初为空,尚未处理任何节点),然后开始工作。每个工作进程从开放集中获取一个节点,将其复制到封闭集中,处理该节点,并将其发现的任何节点添加回开放集中(只要它们不在开放集中或封闭集中) 。一旦开集为空(并且没有工作人员仍在处理节点),图就已被完全遍历。
实际上在多处理中实现这一点可能意味着您将有一个主任务来跟踪开集和闭集;如果工作池中的工作线程表明它已准备好工作,则主工作线程负责将节点从开放集移动到封闭集并启动该工作线程。然后,工作人员可以将他们找到的所有节点传递回主节点,而不必担心它们是否已经关闭;并且 master 将忽略已经关闭的节点。
What you're describing is essentially graph traversal; Most graph traversal algorithms (That are more sophisticated than depth first), keep track of two sets of nodes, in your case, the nodes are url's.
The first set is called the "closed set", and represents all of the nodes that have already been visited and processed. If, while you're processing a page, you find a link that happens to be in the closed set, you can ignore it, it's already been handled.
The second set is unsurprisingly called the "open set", and includes all of the edges that have been found, but not yet processed.
The basic mechanism is to start by putting the root node into the open set (the closed set is initially empty, no nodes have been processed yet), and start working. Each worker takes a single node from the open set, copies it to the closed set, processes the node, and adds any nodes it discovers back to the open set (so long as they aren't already in either the open or closed sets). Once the open set is empty, (and no workers are still processing nodes) the graph has been completely traversed.
Actually implementing this in
multiprocessing
probably means that you'll have a master task that keeps track of the open and closed sets; If a worker in a worker pool indicates that it is ready for work, the master worker takes care of moving the node from the open set to the closed set and starting up the worker. the workers can then pass all of the nodes they find, without worrying about if they are already closed, back to the master; and the master will ignore nodes that are already closed.