同时在 python 中运行多个线程 - 这可能吗?
我正在编写一个小爬虫,它应该多次获取 URL,我希望所有线程同时运行。
我写了一小段代码应该可以做到这一点。
import thread
from urllib2 import Request, urlopen, URLError, HTTPError
def getPAGE(FetchAddress):
attempts = 0
while attempts < 2:
req = Request(FetchAddress, None)
try:
response = urlopen(req, timeout = 8) #fetching the url
print "fetched url %s" % FetchAddress
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except Exception, e:
print 'Something bad happened in gatPAGE.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
else:
try:
return response.read()
except:
"there was an error with response.read()"
return None
return None
url = ("http://www.domain.com",)
for i in range(1,50):
thread.start_new_thread(getPAGE, url)
从apache日志来看,线程似乎没有同时运行,请求之间有一点间隙,几乎无法检测到,但我可以看到线程并不是真正并行的。
我读过有关 GIL 的内容,有没有办法绕过它而不调用 C\C++ 代码? 我真的不明白 GIL 是如何实现线程化的? python 基本上在前一个线程完成后立即解释下一个线程?
谢谢。
I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously).
I've written a little piece of code that should do that.
import thread
from urllib2 import Request, urlopen, URLError, HTTPError
def getPAGE(FetchAddress):
attempts = 0
while attempts < 2:
req = Request(FetchAddress, None)
try:
response = urlopen(req, timeout = 8) #fetching the url
print "fetched url %s" % FetchAddress
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except Exception, e:
print 'Something bad happened in gatPAGE.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
else:
try:
return response.read()
except:
"there was an error with response.read()"
return None
return None
url = ("http://www.domain.com",)
for i in range(1,50):
thread.start_new_thread(getPAGE, url)
from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel.
I've read about GIL, is there a way to bypass it with out calling a C\C++ code?
I can't really understand how does threading is possible with GIL? python basically interpreters the next thread as soon as it finishes with the previous one?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
正如您所指出的,GIL 通常会阻止 Python 线程并行运行。
然而,情况并非总是如此。一个例外是 I/O 密集型代码。当线程等待 I/O 请求完成时,它通常会在进入等待之前释放 GIL。这意味着其他线程可以同时取得进展。
然而,一般来说,
multiprocessing
是更安全的选择需要并行性。As you point out, the GIL often prevents Python threads from running in parallel.
However, that's not always the case. One exception is I/O-bound code. When a thread is waiting for an I/O request to complete, it would typically have released the GIL before entering the wait. This means that other threads can make progress in the meantime.
In general, however,
multiprocessing
is the safer bet when true parallelism is required.并不真地。通过 ctypes 调用的函数将在这些调用期间释放 GIL。执行阻塞 I/O 的函数也会释放它。还有其他类似的情况,但它们总是涉及主 Python 解释器循环之外的代码。你不能放弃 Python 代码中的 GIL。
Not really. Functions called through ctypes will release the GIL for the duration of those calls. Functions that perform blocking I/O will release it too. There are other similar situations, but they always involve code outside the main Python interpreter loop. You can't let go of the GIL in your Python code.
您可以使用这样的方法来创建所有线程,让它们等待条件对象,然后让它们开始“同时”获取 url:
这会让您更接近完成所有获取但是:
accept()
调用来响应您的请求。为了获得正确的行为,这是使用服务器全局锁来实现的,以确保只有一个服务器进程/线程响应您的查询。即使您的某些请求同时到达服务器,这也会导致一些序列化。您可能会让您的请求在更大程度上重叠(即其他请求在某些完成之前开始),但您永远不会让所有请求同时开始在服务器上。
You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url "simultaneously":
This would get you a bit closer to having all fetches happen at the same time, BUT:
accept()
call to respond to your request. For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. Even if some of your requests arrive at the server simultaneously, this will cause some serialisation.You will probably get your requests to overlap to a greater degree (i.e. others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server.
你还可以看看 pypy 的未来,我们将拥有软件过渡内存(从而废除 GIL)。目前这只是研究和智力嘲笑,但它可能会发展成为大事。
You can also look at things like the future of pypy where we will have software transitional memory (thus doing away with the GIL) This is all just research and intellectual scoffing at the moment but it could grow into something big.
如果您使用 Jython 或 IronPython(将来可能还有 PyPy)运行代码,它将并行运行
If you run your code with Jython or IronPython (and maybe PyPy in the future), it will run in parallel