同时在 python 中运行多个线程 - 这可能吗？

发布于 2024-12-04 07:04:06 字数 1506 浏览 1 评论 0原文

我正在编写一个小爬虫，它应该多次获取 URL，我希望所有线程同时运行。

我写了一小段代码应该可以做到这一点。

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            try:
                return response.read()
            except:
                "there was an error with response.read()"
                return None
    return None

url = ("http://www.domain.com",)

for i in range(1,50):
    thread.start_new_thread(getPAGE, url)

从apache日志来看，线程似乎没有同时运行，请求之间有一点间隙，几乎无法检测到，但我可以看到线程并不是真正并行的。

我读过有关 GIL 的内容，有没有办法绕过它而不调用 C\C++ 代码？我真的不明白 GIL 是如何实现线程化的？ python 基本上在前一个线程完成后立即解释下一个线程？

谢谢。

原文

I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously).

I've written a little piece of code that should do that.

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            try:
                return response.read()
            except:
                "there was an error with response.read()"
                return None
    return None

url = ("http://www.domain.com",)

for i in range(1,50):
    thread.start_new_thread(getPAGE, url)

from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel.

I've read about GIL, is there a way to bypass it with out calling a C\C++ code?
I can't really understand how does threading is possible with GIL? python basically interpreters the next thread as soon as it finishes with the previous one?

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

偏闹i 2024-12-11 07:04:06

正如您所指出的，GIL 通常会阻止 Python 线程并行运行。

然而，情况并非总是如此。一个例外是 I/O 密集型代码。当线程等待 I/O 请求完成时，它通常会在进入等待之前释放 GIL。这意味着其他线程可以同时取得进展。

然而，一般来说， multiprocessing 是更安全的选择需要并行性。

回复收藏 0 原文

醉态萌生 2024-12-11 07:04:06

我读过有关 GIL 的内容，有没有办法绕过它而不调用 C\C++ 代码？

并不真地。通过 ctypes 调用的函数将在这些调用期间释放 GIL。执行阻塞 I/O 的函数也会释放它。还有其他类似的情况，但它们总是涉及主 Python 解释器循环之外的代码。你不能放弃 Python 代码中的 GIL。

回复收藏 0 原文

寂寞笑我太脆弱 2024-12-11 07:04:06

您可以使用这样的方法来创建所有线程，让它们等待条件对象，然后让它们开始“同时”获取 url：

#!/usr/bin/env python
import threading
import datetime
import urllib2

allgo = threading.Condition()

class ThreadClass(threading.Thread):
    def run(self):
        allgo.acquire()
        allgo.wait()
        allgo.release()
        print "%s at %s\n" % (self.getName(), datetime.datetime.now())
        url = urllib2.urlopen("http://www.ibm.com")

for i in range(50):
    t = ThreadClass()
    t.start()

allgo.acquire()
allgo.notify_all()
allgo.release()

这会让您更接近完成所有获取但是：

离开您的计算机的网络数据包将按顺序沿着以太网线传输，而不是同时发生，
即使您的计算机上有 16 个以上的核心，某些路由器，桥接器、调制解调器或其他您的计算机和 Web 主机之间的设备可能具有较少的内核，并且可能会序列化您的请求，
您从中获取内容的 Web 服务器将使用 accept() 调用来响应您的请求。为了获得正确的行为，这是使用服务器全局锁来实现的，以确保只有一个服务器进程/线程响应您的查询。即使您的某些请求同时到达服务器，这也会导致一些序列化。

您可能会让您的请求在更大程度上重叠（即其他请求在某些完成之前开始），但您永远不会让所有请求同时开始在服务器上。

You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url "simultaneously":

#!/usr/bin/env python
import threading
import datetime
import urllib2

allgo = threading.Condition()

class ThreadClass(threading.Thread):
    def run(self):
        allgo.acquire()
        allgo.wait()
        allgo.release()
        print "%s at %s\n" % (self.getName(), datetime.datetime.now())
        url = urllib2.urlopen("http://www.ibm.com")

for i in range(50):
    t = ThreadClass()
    t.start()

allgo.acquire()
allgo.notify_all()
allgo.release()

This would get you a bit closer to having all fetches happen at the same time, BUT:

The network packets leaving your computer will pass along the ethernet wire in sequence, not at the same time,
Even if you have 16+ cores on your machine, some router, bridge, modem or other equipment in between your machine and the web host is likely to have fewer cores, and may serialize your requests,
The web server you're fetching stuff from will use an accept() call to respond to your request. For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. Even if some of your requests arrive at the server simultaneously, this will cause some serialisation.

You will probably get your requests to overlap to a greater degree (i.e. others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server.

回复收藏 0 原文