同时在 python 中运行多个线程 - 这可能吗?

发布于 2024-12-04 07:04:06 字数 1506 浏览 1 评论 0原文

我正在编写一个小爬虫,它应该多次获取 URL,我希望所有线程同时运行。

我写了一小段代码应该可以做到这一点。

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            try:
                return response.read()
            except:
                "there was an error with response.read()"
                return None
    return None

url = ("http://www.domain.com",)

for i in range(1,50):
    thread.start_new_thread(getPAGE, url)

从apache日志来看,线程似乎没有同时运行,请求之间有一点间隙,几乎无法检测到,但我可以看到线程并不是真正并行的。

我读过有关 GIL 的内容,有没有办法绕过它而不调用 C\C++ 代码? 我真的不明白 GIL 是如何实现线程化的? python 基本上在前一个线程完成后立即解释下一个线程?

谢谢。

I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously).

I've written a little piece of code that should do that.

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            try:
                return response.read()
            except:
                "there was an error with response.read()"
                return None
    return None

url = ("http://www.domain.com",)

for i in range(1,50):
    thread.start_new_thread(getPAGE, url)

from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel.

I've read about GIL, is there a way to bypass it with out calling a C\C++ code?
I can't really understand how does threading is possible with GIL? python basically interpreters the next thread as soon as it finishes with the previous one?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

偏闹i 2024-12-11 07:04:06

正如您所指出的,GIL 通常会阻止 Python 线程并行运行。

然而,情况并非总是如此。一个例外是 I/O 密集型代码。当线程等待 I/O 请求完成时,它通常会在进入等待之前释放 GIL。这意味着其他线程可以同时取得进展。

然而,一般来说, multiprocessing 是更安全的选择需要并行性。

As you point out, the GIL often prevents Python threads from running in parallel.

However, that's not always the case. One exception is I/O-bound code. When a thread is waiting for an I/O request to complete, it would typically have released the GIL before entering the wait. This means that other threads can make progress in the meantime.

In general, however, multiprocessing is the safer bet when true parallelism is required.

醉态萌生 2024-12-11 07:04:06

我读过有关 GIL 的内容,有没有办法绕过它而不调用 C\C++ 代码?

并不真地。通过 ctypes 调用的函数将在这些调用期间释放 GIL。执行阻塞 I/O 的函数也会释放它。还有其他类似的情况,但它们总是涉及主 Python 解释器循环之外的代码。你不能放弃 Python 代码中的 GIL。

I've read about GIL, is there a way to bypass it with out calling a C\C++ code?

Not really. Functions called through ctypes will release the GIL for the duration of those calls. Functions that perform blocking I/O will release it too. There are other similar situations, but they always involve code outside the main Python interpreter loop. You can't let go of the GIL in your Python code.

寂寞笑我太脆弱 2024-12-11 07:04:06

您可以使用这样的方法来创建所有线程,让它们等待条件对象,然后让它们开始“同时”获取 url:

#!/usr/bin/env python
import threading
import datetime
import urllib2

allgo = threading.Condition()

class ThreadClass(threading.Thread):
    def run(self):
        allgo.acquire()
        allgo.wait()
        allgo.release()
        print "%s at %s\n" % (self.getName(), datetime.datetime.now())
        url = urllib2.urlopen("http://www.ibm.com")

for i in range(50):
    t = ThreadClass()
    t.start()

allgo.acquire()
allgo.notify_all()
allgo.release()

这会让您更接近完成所有获取但是

  • 离开您的计算机的网络数据包将按顺序沿着以太网线传输,而不是同时发生,
  • 即使您的计算机上有 16 个以上的核心,某些路由器,桥接器、调制解调器或其他您的计算机和 Web 主机之间的设备可能具有较少的内核,并且可能会序列化您的请求,
  • 您从中获取内容的 Web 服务器将使用 accept() 调用来响应您的请求。为了获得正确的行为,这是使用服务器全局锁来实现的,以确保只有一个服务器进程/线程响应您的查询。即使您的某些请求同时到达服务器,这也会导致一些序列化。

您可能会让您的请求在更大程度上重叠(即其他请求在某些完成之前开始),但您永远不会让所有请求同时开始在服务器上。

You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url "simultaneously":

#!/usr/bin/env python
import threading
import datetime
import urllib2

allgo = threading.Condition()

class ThreadClass(threading.Thread):
    def run(self):
        allgo.acquire()
        allgo.wait()
        allgo.release()
        print "%s at %s\n" % (self.getName(), datetime.datetime.now())
        url = urllib2.urlopen("http://www.ibm.com")

for i in range(50):
    t = ThreadClass()
    t.start()

allgo.acquire()
allgo.notify_all()
allgo.release()

This would get you a bit closer to having all fetches happen at the same time, BUT:

  • The network packets leaving your computer will pass along the ethernet wire in sequence, not at the same time,
  • Even if you have 16+ cores on your machine, some router, bridge, modem or other equipment in between your machine and the web host is likely to have fewer cores, and may serialize your requests,
  • The web server you're fetching stuff from will use an accept() call to respond to your request. For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. Even if some of your requests arrive at the server simultaneously, this will cause some serialisation.

You will probably get your requests to overlap to a greater degree (i.e. others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server.

不离久伴 2024-12-11 07:04:06

你还可以看看 pypy 的未来,我们将拥有软件过渡内存(从而废除 GIL)。目前这只是研究和智力嘲笑,但它可能会发展成为大事。

You can also look at things like the future of pypy where we will have software transitional memory (thus doing away with the GIL) This is all just research and intellectual scoffing at the moment but it could grow into something big.

不必你懂 2024-12-11 07:04:06

如果您使用 Jython 或 IronPython(将来可能还有 PyPy)运行代码,它将并行运行

If you run your code with Jython or IronPython (and maybe PyPy in the future), it will run in parallel

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文