urllib2 和 asyncore 的性能差异

发布于 2024-12-09 03:08:47 字数 2260 浏览 0 评论 0原文

我对这个简单的 python 脚本的性能有一些疑问:

import sys, urllib2, asyncore, socket, urlparse
from timeit import timeit

class HTTPClient(asyncore.dispatcher):
    def __init__(self, host, path):
        asyncore.dispatcher.__init__(self)
        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        self.connect( (host, 80) )
        self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path
        self.data = ''
    def handle_connect(self):
        pass
    def handle_close(self):
        self.close()
    def handle_read(self):
        self.data += self.recv(8192)
    def writable(self):
        return (len(self.buffer) > 0)
    def handle_write(self):
        sent = self.send(self.buffer)
        self.buffer = self.buffer[sent:]

url = 'http://pacnet.karbownicki.com/api/categories/'

components = urlparse.urlparse(url)
host = components.hostname or ''
path = components.path

def fn1():
    try:
        response = urllib2.urlopen(url)
        try:
            return response.read()
        finally:
            response.close()
    except:
        pass

def fn2():
    client = HTTPClient(host, path)
    asyncore.loop()
    return client.data

if sys.argv[1:]:
    print 'fn1:', len(fn1())
    print 'fn2:', len(fn2())

time = timeit('fn1()', 'from __main__ import fn1', number=1)
print 'fn1: %.8f sec/pass' % (time)

time = timeit('fn2()', 'from __main__ import fn2', number=1)
print 'fn2: %.8f sec/pass' % (time)

这是我在 linux 上得到的输出:

$ python2 test_dl.py
fn1: 5.36162281 sec/pass
fn2: 0.27681994 sec/pass

$ python2 test_dl.py count
fn1: 11781
fn2: 11965
fn1: 0.30849886 sec/pass
fn2: 0.30597305 sec/pass

Why is urllib2 so much Slow than asyncore in the first run?

为什么这种差异在第二次运行时似乎消失了?

编辑:在这里找到了解决此问题的黑客解决方案:强制 python mechanize/urllib2 仅使用 A 请求?

如果我对套接字模块进行猴子修补,五秒延迟就会消失如下:

_getaddrinfo = socket.getaddrinfo

def getaddrinfo(host, port, family=0, socktype=0, proto=0, flags=0):
    return _getaddrinfo(host, port, socket.AF_INET, socktype, proto, flags)

socket.getaddrinfo = getaddrinfo

I have some questions about the performance of this simple python script:

import sys, urllib2, asyncore, socket, urlparse
from timeit import timeit

class HTTPClient(asyncore.dispatcher):
    def __init__(self, host, path):
        asyncore.dispatcher.__init__(self)
        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        self.connect( (host, 80) )
        self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path
        self.data = ''
    def handle_connect(self):
        pass
    def handle_close(self):
        self.close()
    def handle_read(self):
        self.data += self.recv(8192)
    def writable(self):
        return (len(self.buffer) > 0)
    def handle_write(self):
        sent = self.send(self.buffer)
        self.buffer = self.buffer[sent:]

url = 'http://pacnet.karbownicki.com/api/categories/'

components = urlparse.urlparse(url)
host = components.hostname or ''
path = components.path

def fn1():
    try:
        response = urllib2.urlopen(url)
        try:
            return response.read()
        finally:
            response.close()
    except:
        pass

def fn2():
    client = HTTPClient(host, path)
    asyncore.loop()
    return client.data

if sys.argv[1:]:
    print 'fn1:', len(fn1())
    print 'fn2:', len(fn2())

time = timeit('fn1()', 'from __main__ import fn1', number=1)
print 'fn1: %.8f sec/pass' % (time)

time = timeit('fn2()', 'from __main__ import fn2', number=1)
print 'fn2: %.8f sec/pass' % (time)

Here's the output I'm getting on linux:

$ python2 test_dl.py
fn1: 5.36162281 sec/pass
fn2: 0.27681994 sec/pass

$ python2 test_dl.py count
fn1: 11781
fn2: 11965
fn1: 0.30849886 sec/pass
fn2: 0.30597305 sec/pass

Why is urllib2 so much slower than asyncore in the first run?

And why does the discrepancy seem to disappear on the second run?

EDIT: Found a hackish solution to this problem here: Force python mechanize/urllib2 to only use A requests?

The five-second delay disappears if I monkey-patch the socket module as follows:

_getaddrinfo = socket.getaddrinfo

def getaddrinfo(host, port, family=0, socktype=0, proto=0, flags=0):
    return _getaddrinfo(host, port, socket.AF_INET, socktype, proto, flags)

socket.getaddrinfo = getaddrinfo

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

风启觞 2024-12-16 03:08:47

这可能是在您的操作系统中:如果您的操作系统缓存 DNS 请求,则第一个请求必须由 DNS 服务器应答,随后的同名请求已经准备就绪。

编辑:正如评论所示,这可能不是 DNS 问题。我仍然认为这是操作系统而不是Python。我在 Windows 和 FreeBSD 上测试了代码,没有看到这种差异,两个函数需要大约相同的时间。

这正是它应该的样子,单个请求不应该有显着差异。 I/O 和网络延迟可能占这些计时的 90% 左右。

This probably is in your OS: If your OS caches DNS requests, the first request has to be answered by a DNS Server, subsequent requests for the same name are already at hand.

EDIT: as the comments show, it's probably not a DNS problem. I still maintain that it's the OS and not python. I've tested the code both on Windows and on FreeBSD and didn't see this kind of difference, both functions need about the same time.

Which is exactly how it should be, there shouldn't be a significant difference for a single request. I/O and network latency make up probably about 90% of these timings.

梦醒灬来后我 2024-12-16 03:08:47

你尝试过做相反的事情吗?即首先通过 syncore 和 urllib?

情况 1: 我们首先尝试使用 urllib,然后使用 ayncore。

fn1: 1.48460957 sec/pass
fn2: 0.91280798 sec/pass

观察结果:Ayncore 在 0.57180159 秒内完成了相同的操作

让我们反转它。

情况 2: 我们现在尝试使用 ayncore,然后尝试使用 urllib。

fn2: 1.27898671 sec/pass
fn1: 0.95816954 sec/pass the same operation in 0.12081717

观察:这次 Urllib 比 asyncore 花费了 0.32081717 秒

这里有两个结论:

  1. urllib2 总是比 asyncore 花费更多时间,这是因为 urllib2 将套接字系列类型定义为未指定,而 asyncore 让用户定义它,在本例中我们将其定义为 AF_INET IPv4 协议。

  2. 如果两个套接字连接到同一服务器,无论 ayncore 或 urllib 是什么,第二个套接字的性能会更好。这是因为默认缓存行为。要了解更多信息,请查看:https://stackoverflow.com/a/6928657/1060337

参考文献:

想要大致了解套接字的工作原理吗?

http://www.cs .odu.edu/~mweigle/courses/cs455-f06/lectures/2-1-ClientServer.pdf

想要用 python 编写自己的套接字吗?

http://www.ibm.com/developerworks/linux/ tutorials/l-pysocks/index.html

要了解套接字系列或一般术语,请查看此 wiki:

http://en.wikipedia.org/wiki/Berkeley_sockets

注意:此答案最后更新于四月2012 年 05 月,凌晨 2 点(美国标准时间)

Did you try doing the reverse? i.e first via syncore and the urllib?

Case 1: We first try with urllib and then with ayncore.

fn1: 1.48460957 sec/pass
fn2: 0.91280798 sec/pass

Observation: Ayncore did the same operation in 0.57180159 secs less

Lets reverse it.

Case 2: We now try with ayncore and then urllib.

fn2: 1.27898671 sec/pass
fn1: 0.95816954 sec/pass the same operation in 0.12081717

Observation: This time Urllib took 0.32081717 secs than asyncore

Two conclusions here:

  1. urllib2 would always take more time than asyncore and this is because urllib2 defines the socket family type as unspecified while asyncore let user define it and in this case we have defined it as AF_INET IPv4 protocol.

  2. If two sockets are made to same server irrespective of ayncore or urllib, second socket would perform better. And this is because of Default cache behavior. To understand more this, check this out: https://stackoverflow.com/a/6928657/1060337

References:

Want a general overview of how socket works?

http://www.cs.odu.edu/~mweigle/courses/cs455-f06/lectures/2-1-ClientServer.pdf

Want to write your own socket in python?

http://www.ibm.com/developerworks/linux/tutorials/l-pysocks/index.html

To know about socket families or general terminology check this wiki:

http://en.wikipedia.org/wiki/Berkeley_sockets

Note: This answer was last updated on April 05, 2012, 2AM IST

記柔刀 2024-12-16 03:08:47

终于找到了很好的解释这个问题,以及原因:

这是 DNS 解析器的问题。

对于 DNS 解析器处理的任何 DNS 请求,都会出现此问题
不支持。正确的解决方案是修复 DNS 解析器。

发生了什么:

  • 程序已启用 IPv6。
  • 当查找主机名时,getaddrinfo() 首先询问 AAAA 记录
  • DNS 解析器看到对 AAAA 记录的请求,会说“嗯,我不知道它是什么,让我们把它扔掉”
  • DNS 客户端(libc 中的 getaddrinfo())等待响应......由于没有响应而必须超时。 (这就是延迟)
  • 尚未收到记录,因此 getaddrinfo() 会请求 A 记录。这有效。
  • 程序获取 A 记录并使用这些记录。

这不仅影响 IPv6 (AAAA) 记录,还会影响任何
解析器不支持的其他 DNS 记录。

对我来说,解决方案是安装 dnsmasq (但我想任何其他 DNS解析器就可以了)。

Finally found a good explanation of what causes this problem, and why:

This is a problem with the DNS resolver.

This problem will occur for any DNS request which the DNS resolver
does not support. The proper solution is to fix the DNS resolver.

What happens:

  • Program is IPv6 enabled.
  • When it looks up a hostname, getaddrinfo() asks first for a AAAA record
  • the DNS resolver sees the request for the AAAA record, goes "uhmmm I dunno what it is, lets throw it away"
  • DNS client (getaddrinfo() in libc) waits for a response..... has to time out as there is no response. (THIS IS THE DELAY)
  • No records received yet, thus getaddrinfo() goes for a the A record request. This works.
  • Program gets the A records and uses those.

This does NOT only affect IPv6 (AAAA) records, it also affects any
other DNS record that the resolver does not support.

For me, the solution was to install dnsmasq (but I suppose any other DNS resolver would do).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文