Python 进程被 urllib2 阻止
我设置了一个进程来读取要下载的传入 URL 的队列,但是当 urllib2 打开连接时系统会挂起。
import urllib2, multiprocessing
from threading import Thread
from Queue import Queue
from multiprocessing import Queue as ProcessQueue, Process
def download(url):
"""Download a page from an url.
url [str]: url to get.
return [unicode]: page downloaded.
"""
if settings.DEBUG:
print u'Downloading %s' % url
request = urllib2.Request(url)
response = urllib2.urlopen(request)
encoding = response.headers['content-type'].split('charset=')[-1]
content = unicode(response.read(), encoding)
return content
def downloader(url_queue, page_queue):
def _downloader(url_queue, page_queue):
while True:
try:
url = url_queue.get()
page_queue.put_nowait({'url': url, 'page': download(url)})
except Exception, err:
print u'Error downloading %s' % url
raise err
finally:
url_queue.task_done()
## Init internal workers
internal_url_queue = Queue()
internal_page_queue = Queue()
for num in range(multiprocessing.cpu_count()):
worker = Thread(target=_downloader, args=(internal_url_queue, internal_page_queue))
worker.setDaemon(True)
worker.start()
# Loop waiting closing
for url in iter(url_queue.get, 'STOP'):
internal_url_queue.put(url)
# Wait for closing
internal_url_queue.join()
# Init the queues
url_queue = ProcessQueue()
page_queue = ProcessQueue()
# Init the process
download_worker = Process(target=downloader, args=(url_queue, page_queue))
download_worker.start()
从另一个模块,我可以添加网址,当我需要时,我可以停止进程并等待进程关闭。
import module
module.url_queue.put('http://foobar1')
module.url_queue.put('http://foobar2')
module.url_queue.put('http://foobar3')
module.url_queue.put('STOP')
downloader.download_worker.join()
问题是,当我使用 urlopen ("response = urllib2.urlopen(request)") 时,它仍然被阻止。
如果我调用 download() 函数或者仅使用没有 Process 的线程,则没有问题。
I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs.
import urllib2, multiprocessing
from threading import Thread
from Queue import Queue
from multiprocessing import Queue as ProcessQueue, Process
def download(url):
"""Download a page from an url.
url [str]: url to get.
return [unicode]: page downloaded.
"""
if settings.DEBUG:
print u'Downloading %s' % url
request = urllib2.Request(url)
response = urllib2.urlopen(request)
encoding = response.headers['content-type'].split('charset=')[-1]
content = unicode(response.read(), encoding)
return content
def downloader(url_queue, page_queue):
def _downloader(url_queue, page_queue):
while True:
try:
url = url_queue.get()
page_queue.put_nowait({'url': url, 'page': download(url)})
except Exception, err:
print u'Error downloading %s' % url
raise err
finally:
url_queue.task_done()
## Init internal workers
internal_url_queue = Queue()
internal_page_queue = Queue()
for num in range(multiprocessing.cpu_count()):
worker = Thread(target=_downloader, args=(internal_url_queue, internal_page_queue))
worker.setDaemon(True)
worker.start()
# Loop waiting closing
for url in iter(url_queue.get, 'STOP'):
internal_url_queue.put(url)
# Wait for closing
internal_url_queue.join()
# Init the queues
url_queue = ProcessQueue()
page_queue = ProcessQueue()
# Init the process
download_worker = Process(target=downloader, args=(url_queue, page_queue))
download_worker.start()
From another module I can add urls and when I want I can stop the process and wait the process closing.
import module
module.url_queue.put('http://foobar1')
module.url_queue.put('http://foobar2')
module.url_queue.put('http://foobar3')
module.url_queue.put('STOP')
downloader.download_worker.join()
The problem is that when I use urlopen ("response = urllib2.urlopen(request)") it remain all blocked.
There are no problem if I call the download() function or when I use only threads without Process.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这里的问题不是 urllib2,而是 multiprocessing 模块的使用。在 Windows 下使用多处理模块时,不得使用在导入模块时立即运行的代码 - 相反,请将主模块中的内容放在
if __name__=='__main__'
块内。请参阅此处的“安全导入主模块”部分。对于您的代码,请在下载程序模块中进行以下更改:
在主模块中:
因为您没有这样做,所以每次启动子进程时,它都会再次运行主代码并启动另一个进程,从而导致挂起。
The issue here is not urllib2, but the use of the multiprocessing module. When using the multiprocessing module under Windows, you must not use code that runs immediately when importing your module - instead, put things in the main module inside a
if __name__=='__main__'
block. See section "Safe importing of main module" here.For your code, make this change following in the downloader module:
And in the main module:
Because you didn't do this, each time the subprocess was started it would run the main code again and start another process, causing the hang.