Python 进程被 urllib2 阻止

发布于 2024-08-18 18:54:09 字数 2080 浏览 7 评论 0原文

我设置了一个进程来读取要下载的传入 URL 的队列,但是当 urllib2 打开连接时系统会挂起。

import urllib2, multiprocessing
from threading import Thread
from Queue import Queue
from multiprocessing import Queue as ProcessQueue, Process

def download(url):
    """Download a page from an url.
    url [str]: url to get.
    return [unicode]: page downloaded.
    """
    if settings.DEBUG:
        print u'Downloading %s' % url
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    encoding = response.headers['content-type'].split('charset=')[-1]
    content = unicode(response.read(), encoding)
    return content

def downloader(url_queue, page_queue):
    def _downloader(url_queue, page_queue):
        while True:
            try:
                url = url_queue.get()
                page_queue.put_nowait({'url': url, 'page': download(url)})
            except Exception, err:
                print u'Error downloading %s' % url
                raise err
            finally:
                url_queue.task_done()

    ## Init internal workers
    internal_url_queue = Queue()
    internal_page_queue = Queue()
    for num in range(multiprocessing.cpu_count()):
        worker = Thread(target=_downloader, args=(internal_url_queue, internal_page_queue))
        worker.setDaemon(True)
        worker.start()

    # Loop waiting closing
    for url in iter(url_queue.get, 'STOP'):
        internal_url_queue.put(url)

    # Wait for closing
    internal_url_queue.join()

# Init the queues
url_queue = ProcessQueue()
page_queue = ProcessQueue()

# Init the process
download_worker = Process(target=downloader, args=(url_queue, page_queue))
download_worker.start()

从另一个模块,我可以添加网址,当我需要时,我可以停止进程并等待进程关闭。

import module

module.url_queue.put('http://foobar1')
module.url_queue.put('http://foobar2')
module.url_queue.put('http://foobar3')
module.url_queue.put('STOP')
downloader.download_worker.join()

问题是,当我使用 urlopen ("response = urllib2.urlopen(request)") 时,它仍然被阻止。

如果我调用 download() 函数或者仅使用没有 Process 的线程,则没有问题。

I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs.

import urllib2, multiprocessing
from threading import Thread
from Queue import Queue
from multiprocessing import Queue as ProcessQueue, Process

def download(url):
    """Download a page from an url.
    url [str]: url to get.
    return [unicode]: page downloaded.
    """
    if settings.DEBUG:
        print u'Downloading %s' % url
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    encoding = response.headers['content-type'].split('charset=')[-1]
    content = unicode(response.read(), encoding)
    return content

def downloader(url_queue, page_queue):
    def _downloader(url_queue, page_queue):
        while True:
            try:
                url = url_queue.get()
                page_queue.put_nowait({'url': url, 'page': download(url)})
            except Exception, err:
                print u'Error downloading %s' % url
                raise err
            finally:
                url_queue.task_done()

    ## Init internal workers
    internal_url_queue = Queue()
    internal_page_queue = Queue()
    for num in range(multiprocessing.cpu_count()):
        worker = Thread(target=_downloader, args=(internal_url_queue, internal_page_queue))
        worker.setDaemon(True)
        worker.start()

    # Loop waiting closing
    for url in iter(url_queue.get, 'STOP'):
        internal_url_queue.put(url)

    # Wait for closing
    internal_url_queue.join()

# Init the queues
url_queue = ProcessQueue()
page_queue = ProcessQueue()

# Init the process
download_worker = Process(target=downloader, args=(url_queue, page_queue))
download_worker.start()

From another module I can add urls and when I want I can stop the process and wait the process closing.

import module

module.url_queue.put('http://foobar1')
module.url_queue.put('http://foobar2')
module.url_queue.put('http://foobar3')
module.url_queue.put('STOP')
downloader.download_worker.join()

The problem is that when I use urlopen ("response = urllib2.urlopen(request)") it remain all blocked.

There are no problem if I call the download() function or when I use only threads without Process.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

你怎么这么可爱啊 2024-08-25 18:54:09

这里的问题不是 urllib2,而是 multiprocessing 模块的使用。在 Windows 下使用多处理模块时,不得使用在导入模块时立即运行的代码 - 相反,请将主模块中的内容放在 if __name__=='__main__' 块内。请参阅此处的“安全导入主模块”部分。

对于您的代码,请在下载程序模块中进行以下更改:

#....
def start():
    global download_worker
    download_worker = Process(target=downloader, args=(url_queue, page_queue))
    download_worker.start()

在主模块中:

import module
if __name__=='__main__':
    module.start()
    module.url_queue.put('http://foobar1')
    #....

因为您没有这样做,所以每次启动子进程时,它都会再次运行主代码并启动另一个进程,从而导致挂起。

The issue here is not urllib2, but the use of the multiprocessing module. When using the multiprocessing module under Windows, you must not use code that runs immediately when importing your module - instead, put things in the main module inside a if __name__=='__main__' block. See section "Safe importing of main module" here.

For your code, make this change following in the downloader module:

#....
def start():
    global download_worker
    download_worker = Process(target=downloader, args=(url_queue, page_queue))
    download_worker.start()

And in the main module:

import module
if __name__=='__main__':
    module.start()
    module.url_queue.put('http://foobar1')
    #....

Because you didn't do this, each time the subprocess was started it would run the main code again and start another process, causing the hang.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文