如何在Python 3中处理urllib的超时?

发布于 2024-12-25 08:26:22 字数 1806 浏览 2 评论 0原文

首先,我的问题与这个非常相似。我希望 urllib.urlopen() 超时来生成我可以处理的异常。

这不属于 URLError 吗?

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
    logging.error(
        'Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
    logging.info('Access successful.')

错误信息:

resp = urllib.request.urlopen(req, timeout=10).read().decode('utf-8')
文件“/usr/lib/python3.2/urllib/request.py”,第 138 行,位于 urlopen
return opener.open(url, data, timeout)
文件“/usr/lib/python3.2/urllib/request.py”,第 369 行,打开
响应 = self._open(req, 数据)
文件“/usr/lib/python3.2/urllib/request.py”,第 387 行,位于 _open
'_open',要求)
文件“/usr/lib/python3.2/urllib/request.py”,第 347 行,位于 _call_chain
结果 = func(*args)
文件“/usr/lib/python3.2/urllib/request.py”,第 1156 行,位于 http_open
返回 self.do_open(http.client.HTTPConnection, req)
文件“/usr/lib/python3.2/urllib/request.py”,第 1141 行,在 do_open
r = h.getresponse()
文件“/usr/lib/python3.2/http/client.py”,第 1046 行,在 getresponse
响应.begin()
文件“/usr/lib/python3.2/http/client.py”,第 346 行,开始
版本、状态、原因 = self._read_status()
文件“/usr/lib/python3.2/http/client.py”,第 308 行,位于 _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
文件“/usr/lib/python3.2/socket.py”,第 276 行,位于 readinto
返回 self._sock.recv_into(b)
socket.timeout: 超时

Python 3 中发生了重大变化,他们将 urlliburllib2 模块重新组织到 urllib 中。是否有可能是当时发生了变化导致了这种情况?

First off, my problem is quite similar to this one. I would like a timeout of urllib.urlopen() to generate an exception that I can handle.

Doesn't this fall under URLError?

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
    logging.error(
        'Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
    logging.info('Access successful.')

The error message:

resp = urllib.request.urlopen(req, timeout=10).read().decode('utf-8')
File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 369, in open
response = self._open(req, data)
File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
'_open', req)
File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 1156, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.2/urllib/request.py", line 1141, in do_open
r = h.getresponse()
File "/usr/lib/python3.2/http/client.py", line 1046, in getresponse
response.begin()
File "/usr/lib/python3.2/http/client.py", line 346, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.2/http/client.py", line 308, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.2/socket.py", line 276, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

There was a major change from in Python 3 when they re-organised the urllib and urllib2 modules into urllib. Is it possible that there was a change then that causes this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

甜心小果奶 2025-01-01 08:26:22

使用显式子句捕获不同的异常,并使用 URLError 检查异常的原因(谢谢 Régis B.< /a> 和 丹尼尔·安杰耶夫斯基

from socket import timeout
from urllib.error import HTTPError, URLError

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('HTTP Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
except URLError as error:
    if isinstance(error.reason, timeout):
        logging.error('Timeout Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
    else:
        logging.error('URL Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
    logging.info('Access successful.')

注意对于最近的评论,原始帖子引用了 python 3.2,其中您需要使用 socket.timeout 显式捕获超时错误。例如



    # Warning - python 3.2 code
    from socket import timeout
    
    try:
        response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
    except timeout:
        logging.error('socket timed out - URL %s', url)

Catch the different exceptions with explicit clauses, and check the reason for the exception with URLError (thank you Régis B. and Daniel Andrzejewski)

from socket import timeout
from urllib.error import HTTPError, URLError

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('HTTP Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
except URLError as error:
    if isinstance(error.reason, timeout):
        logging.error('Timeout Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
    else:
        logging.error('URL Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
    logging.info('Access successful.')

NB For recent comments, the original post referenced python 3.2 where you needed to catch timeout errors explicitly with socket.timeout. For example



    # Warning - python 3.2 code
    from socket import timeout
    
    try:
        response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
    except timeout:
        logging.error('socket timed out - URL %s', url)

凑诗 2025-01-01 08:26:22

前面的答案没有正确拦截超时错误。超时错误以 URLError 形式引发,因此如果我们想专门捕获它们,我们需要编写:

from urllib.error import HTTPError, URLError
import socket

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('Data not retrieved because %s\nURL: %s', error, url)
except URLError as error:
    if isinstance(error.reason, socket.timeout):
        logging.error('socket timed out - URL %s', url)
    else:
        logging.error('some other error happened)
else:
    logging.info('Access successful.')

请注意,ValueError 可以独立引发,即如果 URL 无效。与 HTTPError 一样,它与超时无关。

The previous answer does not correctly intercept timeout errors. Timeout errors are raised as URLError, so if we want to specifically catch them, we need to write:

from urllib.error import HTTPError, URLError
import socket

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('Data not retrieved because %s\nURL: %s', error, url)
except URLError as error:
    if isinstance(error.reason, socket.timeout):
        logging.error('socket timed out - URL %s', url)
    else:
        logging.error('some other error happened)
else:
    logging.info('Access successful.')

Note that ValueError can independently be raised, i.e. if the URL is invalid. Like HTTPError, it is not associated with a timeout.

-黛色若梦 2025-01-01 08:26:22

什么是“超时”?总的来说,我认为这意味着“服务器没有及时响应的情况,通常是由于高负载,值得重试。”

HTTP 状态 504“网关超时”将是此定义下的超时。它是通过 HTTPError 传递的。

根据该定义,HTTP 状态 429“请求过多”也属于超时。它也是通过 HTTPError 传递的。

否则,超时是什么意思?我们在通过 DNS 解析器解析域名时是否包含超时?尝试发送数据时超时?等待数据返回超时?

我不知道如何审核 urllib 的源代码,以确保我可能考虑超时的每一种可能的方式都以我能捕捉到的方式提出。在没有检查异常的语言中,我不知道如何。我有预感,连接到 dns 错误可能会以 socket.timeout 的形式返回,而连接到远程服务器的错误可能会以 URLError(socket.timeout) 的形式返回?这只是一个猜测,可以解释之前的观察结果。

所以我又回到了一些真正防御性的编码上。 (1) 我正在处理一些指示超时的 HTTP 状态代码。 (2) 有报告称,有些超时是通过 socket.timeout 异常来的,有些是通过 URLError(socket.timeout) 异常来的,所以我捕获了两者。 (3) 为了以防万一,我也加入了 HTTPError(socket.timeout) 。

while True:
    reason : Optional[str] = None
    try:
        with urllib.request.urlopen(url) as response:
            content = response.read()
            with open(cache,"wb") as file:
                file.write(content)
            return content
    except urllib.error.HTTPError as e:
        if e.code == 429 or e.code == 504: # 429=too many requests, 504=gateway timeout
            reason = f'{e.code} {str(e.reason)}'
        elif isinstance(e.reason, socket.timeout):
            reason = f'HTTPError socket.timeout {e.reason} - {e}'
        else:
            raise
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            reason = f'URLError socket.timeout {e.reason} - {e}'
        else:
            raise
    except socket.timeout as e:
        reason = f'socket.timeout {e}'
    except:
        raise
    netloc = urllib.parse.urlsplit(url).netloc # e.g. nominatim.openstreetmap.org
    print(f'*** {netloc} {reason}; will retry', file=sys.stderr)
    time.sleep(5)

What is a "timeout"? Holistically I think it means "a situation where the server didn't respond in time, typically because of high load, and it's worth retrying again."

HTTP status 504 "gateway timeout" would be a timeout under this definition. It's delivered via HTTPError.

HTTP status 429 "too many requests" would also be a timeout under that definition. It too is delivered via HTTPError.

Otherwise, what do we mean by a timeout? Do we include timeouts in resolving the domain name via the DNS resolver? timeouts when trying to send data? timeouts when waiting for the data to come back?

I don't know how to audit the source code of urllib to be sure that every possible way that I might consider a timeout, is being raised in a way that I'd catch. In a language without checked exceptions, I don't know how. I have a hunch that maybe connect-to-dns errors might be coming back as socket.timeout, and connect-to-remote-server errors might be coming back as URLError(socket.timeout)? It's just a guess that might explain earlier observations.

So I fell back to some really defensive coding. (1) I'm handling some HTTP status codes that are indicative of timeouts. (2) There are reports that some timeouts come via socket.timeout exceptions, and some via URLError(socket.timeout) exceptions, so I'm catching both. (3) And just in case, I threw in HTTPError(socket.timeout) as well.

while True:
    reason : Optional[str] = None
    try:
        with urllib.request.urlopen(url) as response:
            content = response.read()
            with open(cache,"wb") as file:
                file.write(content)
            return content
    except urllib.error.HTTPError as e:
        if e.code == 429 or e.code == 504: # 429=too many requests, 504=gateway timeout
            reason = f'{e.code} {str(e.reason)}'
        elif isinstance(e.reason, socket.timeout):
            reason = f'HTTPError socket.timeout {e.reason} - {e}'
        else:
            raise
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            reason = f'URLError socket.timeout {e.reason} - {e}'
        else:
            raise
    except socket.timeout as e:
        reason = f'socket.timeout {e}'
    except:
        raise
    netloc = urllib.parse.urlsplit(url).netloc # e.g. nominatim.openstreetmap.org
    print(f'*** {netloc} {reason}; will retry', file=sys.stderr)
    time.sleep(5)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文