龙卷风 AsyncHTTPClient.fetch 异常

发布于 2024-12-09 01:31:32 字数 1624 浏览 1 评论 0原文

我正在使用 tornado.httpclient.AsyncHTTPClient.fetch 从列表中获取域。当我以一些大的间隔（例如 500）来获取域时，一切都很好，但是当我将间隔减少到 100 时，下一个异常会时不时地发生：


Traceback (most recent call last):
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 289, in cleanup
    yield
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/stack_context.py", line 183, in wrapped
    callback(*args, **kwargs)
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 384, in _on_chunk_length
    self._on_chunk_data)
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/iostream.py", line 180, in read_bytes
    self._check_closed()
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/iostream.py", line 504, in _check_closed
    raise IOError("Stream is closed")
IOError: Stream is closed

这种行为的原因是什么？代码如下所示：


def fetch_domain(domain):
    http_client = AsyncHTTPClient()
    request = HTTPRequest('http://' + domain,
       user_agent=CRAWLER_USER_AGENT)
    http_client.fetch(request, handle_domain)


class DomainFetcher(object):
    def __init__(self, domains_iterator):
        self.domains = domains_iterator

    def __call__(self):
        try:
            domain = next(self.domains)
        except StopIteration:
            domain_generator.stop()
            ioloop.IOLoop.instance().stop()
        else:
            fetch_domain(domain)

domain_generator = ioloop.PeriodicCallback(DomainFetcher(domains), 500)
domain_generator.start()

原文

I am using tornado.httpclient.AsyncHTTPClient.fetch to fetch domains from list. When I put domains to fetch with some big interval(500 for example) all works good, but when I decrease the inerval to 100, next exception occurs time to time:


Traceback (most recent call last):
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 289, in cleanup
    yield
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/stack_context.py", line 183, in wrapped
    callback(*args, **kwargs)
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 384, in _on_chunk_length
    self._on_chunk_data)
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/iostream.py", line 180, in read_bytes
    self._check_closed()
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/iostream.py", line 504, in _check_closed
    raise IOError("Stream is closed")
IOError: Stream is closed

What can be the reason of this behavior? Code looks like this:


def fetch_domain(domain):
    http_client = AsyncHTTPClient()
    request = HTTPRequest('http://' + domain,
       user_agent=CRAWLER_USER_AGENT)
    http_client.fetch(request, handle_domain)


class DomainFetcher(object):
    def __init__(self, domains_iterator):
        self.domains = domains_iterator

    def __call__(self):
        try:
            domain = next(self.domains)
        except StopIteration:
            domain_generator.stop()
            ioloop.IOLoop.instance().stop()
        else:
            fetch_domain(domain)

domain_generator = ioloop.PeriodicCallback(DomainFetcher(domains), 500)
domain_generator.start()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

莫相离 2024-12-16 01:31:32

请注意，tornado.ioloop.PeriodicCallback需要一个周期时间以整数毫秒为单位，而HTTPRequest 对象配置有 < code>connect_timeout 和/或 request_timeout 浮点秒 (参见文档）。

“当从点击到响应的延迟小于 100 毫秒时，浏览互联网的用户会感觉响应是“即时”的”(来自维基百科）参见此 ServerFault 问题针对正常延迟 价值观。

IOError: Stream is close 被有效地引发，以通知您连接超时而未完成，或者更准确地说，您在尚未打开的管道上手动调用了回调。这很好，因为延迟大于 1 并不是异常。 100毫秒；如果您希望提取能够可靠地完成，则应该提高该值。

一旦您将超时设置为合理的值，请考虑将您的提取包装在 try/ except 重试循环中，因为这是您可能会在生产中发生的正常异常。请注意设置重试限制！

既然您使用的是异步框架，为什么不让它自己处理异步回调，而不是按固定时间间隔运行所述回调？ Epoll/kqueue 非常高效并且受此框架支持。

import ioloop

def handle_request(response):
    if response.error:
        print "Error:", response.error
    else:
        print response.body
    ioloop.IOLoop.instance().stop()

http_client = httpclient.AsyncHTTPClient()
http_client.fetch("http://www.google.com/", handle_request)
ioloop.IOLoop.instance().start()

^ 从文档中逐字复制。

如果您走这条路，唯一的问题是对请求队列进行编码，以便强制执行最大打开连接数。否则，在进行严重的刮擦时，您可能会遇到竞争状况。

自从我自己接触 Tornado 以来已经有大约一年了，所以如果这个回复中有不准确的地方，请告诉我，我会修改。

note that tornado.ioloop.PeriodicCallback takes a cycle time in integer ms while the HTTPRequest object is configured with a connect_timeout and/or a request_timeout of float seconds (see doc).

"Users browsing the Internet feel that responses are "instant" when delays are less than 100 ms from click to response" (from wikipedia) See this ServerFault question for normal latency values.

IOError: Stream is closed is validly being raised to inform you that your connection timed out without completing, or more accurately, you called the callback manually on a pipe that wasn't open yet. This is good, since it is not abnormal for latency to be > 100ms; if you expect your fetches to complete reliably you should raise this value.

Once you've got your timeout set to something sane, consider wrapping your fetches in a try/except retry loop as this is a normal exception that you can expect to occur in production. Just be careful to set a retry limit!

Since you're using an async framework, why not let it handle the async callback itself instead of running said callback on a fixed interval? Epoll/kqueue are efficient and supported by this framework.

import ioloop

def handle_request(response):
    if response.error:
        print "Error:", response.error
    else:
        print response.body
    ioloop.IOLoop.instance().stop()

http_client = httpclient.AsyncHTTPClient()
http_client.fetch("http://www.google.com/", handle_request)
ioloop.IOLoop.instance().start()

^ Copied verbatim from the doc.

If you go this route, the only gotcha is to code your request queue so that you have a maximum open connections enforced. Otherwise you're likely to end up with a race condition when doing serious scraping.

It's been ~1yr since I touched Tornado myself, so please let me know if there are inaccuracies in this response and I will revise.

回复收藏 0 原文

能否归途做我良人 2024-12-16 01:31:32

看起来你正在写类似网络爬虫的东西。您的问题是直接由超时引起的，但在深层，与龙卷风中的并行模式有关。

当然，tornado 中的 AsyncHTTPClient 可以自动对请求进行排队。实际上，AsyncHTTPClient 会批量发送 10 个请求（默认情况下），并阻塞等待结果，然后发送下一批。批次内的请求是非阻塞的并且并行处理，但批次之间是阻塞的。并且每个请求的回调不会在该请求完成后立即调用，而是在该批请求完成后调用 10 个回调。

回到你的问题，你不需要使用 ioloop.PeriodicCallback 来增量地发出请求，因为龙卷风中的 AsyncHTTPClient 可以自动对请求进行排队。您可以一次性分配所有请求，让 AsyncHTTPClient 来调度请求。

但问题来了，等待队列中的请求仍然消耗超时时间！因为请求在批次之间被阻塞。后面的请求只是在这里阻塞，然后批量发送，而不是将它们放入特殊的就绪队列中，并在响应到达后发送新的请求。

因此，如果调度的请求较多，默认的超时设置为 20 秒就太短了。如果您只是做一个演示，您可以直接将超时设置为float('inf')。如果做一些严肃的事情，你必须使用 try/ except 重试循环。

您可以从 tornado/httpclient.py< 找到如何设置超时/code>，在这里引用。

connect_timeout：初始连接超时（以秒为单位）
request_timeout：整个请求的超时时间（以秒为单位）

最后，我编写了一个简单的程序，使用 AsyncHTTPClient 从 ZJU 在线评审系统获取数千个页面。你可以尝试一下，然后重写你的爬虫。在我的网络上，它可以在 2 分钟内获取 2800 个页面。非常好的结果，比串行读取快 10 倍（完全匹配并行大小）。

#!/usr/bin/env python
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from tornado.ioloop import IOLoop

baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='

start = 1001
end = 3800
count = end - start
done = 0

client = AsyncHTTPClient()

def onResponse(response):
    if response.error:
        print('Error: %s' % response.error)
    else:
        global done
        done += 1
        #It is comment out here, you could uncomment it and watch something interest, that len(client.queue) is reduce 10 by 10.
        #print('Queue length %s, Active client count %s, Max active clients limit %s' % (len(client.queue), len(client.active), client.max_clients))
        print('Received %s, Content length %s, Done %s' % (response.effective_url[-4:], len(response.body), done))
        if(done == count):
            IOLoop.instance().stop()

for i in range (start, end):
    request = HTTPRequest(baseUrl + str(i), connect_timeout=float('inf'), request_timeout=float('inf'))
    client.fetch(request, onResponse)
    print('Generated %s' % i)

IOLoop.instance().start()

额外：

如果您有大量页面需要获取，并且您是那种追求最佳性能的人，那么您可以看看 Twisted。我用 Twisted 编写了一个相同的程序，并将其粘贴到我的 Gist 上。其结果非常棒：40 秒内获取 2800 个页面。

It looks like you are writing something like web crawler. You problem is cause by timeout directly, but in deep, related to the parallel pattern in tornado.

Of course, AsyncHTTPClient in tornado could automatically queue the requests. Actually, AsyncHTTPClient will send 10 requests(by default) in batch, and block to wait for their result, then send the next batch. The requests within batch is non-block and process in parallel, but it is block between batches. And the callback for each request is not called immediately after that request has done, but after that batch of requests has done and then call 10 callbacks.

Back to your problem, you needn't to use ioloop.PeriodicCallback to incrementally make the requests, since AsyncHTTPClient in tornado could automatically queue the requests. You could assign all of the requests in one time, let the AsyncHTTPClient to schedule the requests.

But here comes the problem that requests in the waiting queue still consume the timeout time! Because requests are block between batches. Later requests are simply block here, and send batch by batch, rather than put them in a special ready queue and send a new requests once a response arrived.

Therefore, the default timeout set to 20s is too-short if many requests scheduled. If you are just making a demo, you could directly set the timeout to float('inf'). If making something serious, you have to use try/except retry loop.

You could find how to set timeout from tornado/httpclient.py, quote here.

connect_timeout: Timeout for initial connection in seconds
request_timeout: Timeout for entire request in seconds

In the end, I write a simple program that use AsyncHTTPClient to fetch thousands of pages from ZJU Online Judgement System. You could have a try on this, and then rewrite to your crawler. On my network, it could fetch 2800 pages in 2 minutes. Very good results, 10 times(exactly match the parallel size) faster than serial fetch.

#!/usr/bin/env python
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from tornado.ioloop import IOLoop

baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='

start = 1001
end = 3800
count = end - start
done = 0

client = AsyncHTTPClient()

def onResponse(response):
    if response.error:
        print('Error: %s' % response.error)
    else:
        global done
        done += 1
        #It is comment out here, you could uncomment it and watch something interest, that len(client.queue) is reduce 10 by 10.
        #print('Queue length %s, Active client count %s, Max active clients limit %s' % (len(client.queue), len(client.active), client.max_clients))
        print('Received %s, Content length %s, Done %s' % (response.effective_url[-4:], len(response.body), done))
        if(done == count):
            IOLoop.instance().stop()

for i in range (start, end):
    request = HTTPRequest(baseUrl + str(i), connect_timeout=float('inf'), request_timeout=float('inf'))
    client.fetch(request, onResponse)
    print('Generated %s' % i)

IOLoop.instance().start()

Extra:

If you have plenty of pages to fetch, and you are the type of people that chase for best performance, you could have a look on Twisted. I write a same program with Twisted and paste it on my Gist. Its result is awesome: fetch 2800 pages in 40 seconds.

回复收藏 0 原文

~没有更多了~