如何管理多线程爬虫中的线程关闭?
假设我正在编写一个多线程网络爬虫。线程从队列中获取作业(例如,以 URL 的形式),执行一些工作,然后可能会向队列添加一些新作业。听起来很简单,但我不确定如何处理所有工作都已完成的情况。假设队列中当前有 0 个作业,并且某个线程正在尝试获取新作业。此时可能有两种情况:
- 一些其他线程正在工作,并且可能实际上为该线程产生新的作业。在这种情况下,可能只是等待新任务(使用阻塞 .pop(),如果队列支持它,或者只是通过不时睡眠和唤醒来检查作业是否可用)
- 所有其他线程也在等待作业。在这种情况下,无法产生新的作业,因此必须终止线程。
我能想到的一个解决方案是有一个整数(在互斥体后面),它应该充当许多“繁忙”线程 - 当线程获得工作时它会增加,并且一旦完成处理它就会减少。这样,如果有 0 个作业和 0 个线程在工作,则可以安全地终止线程。但是,我不确定这是否是最好的解决方案。还有其他选择来处理这种情况吗?
Let's say I'm writing a mulithreaded web crawler. Threads get a job (for example, in a form of URL) from a queue, do some work, and then might add some new jobs to the queue. Sounds simple enough, but I'm not sure how to handle the situation where all the jobs are done. Let's say there are currently 0 jobs in the queue, and some thread is trying to get a new job. At this point two situations are possible:
- Some other threads are working and might actually produce new jobs for this thread to get. In this case, it is probably possible to just wait for a new task (with a blocking .pop(), if the queue supports it, or just by sleeping and waking up time to time to check if a job is available)
- All other threads are also waiting for a job. In this case, no new jobs can be produced, so threads must be terminated.
One solution I can think of is having an integer (behind a mutex), which should serve as a number of "busy" threads - it will be increased when thread gets a job, and decreased once it is finished processing it. This way, if there is 0 jobs and 0 threads working, a thread can safely be terminated. However, I'm not sure it is the best solution possible. Are there any other options to handle such a situation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论