urllib2 尝试并排除 404

发布于 2024-12-18 01:53:41 字数 2297 浏览 3 评论 0原文

我正在尝试使用 urlib2 浏览一系列编号的数据页。我想做的是使用 try 语句，但我对它知之甚少，通过阅读一点来看，它似乎是基于异常的特定“名称”，例如 IOError 等。我不知道什么错误代码是我正在寻找的，这是问题的一部分。

我已经从“urllib2 缺失的手册”中编写/粘贴了我的 urllib2 页面获取例程：

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()

    txheaders =  {'User-agent' : useragent}

    if os.path.isfile(COOKIEFILE):
        cj.load(COOKIEFILE)
        print "previous cookie loaded..."
    else:
        print "no ospath to cookfile"

    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)
    try:
        req = urllib2.Request(url, useragent)
        # create a request object

        handle = urlopen(req)
        # and open it to return a handle on the url

    except IOError, e:
        print 'Failed to open "%s".' % url
        if hasattr(e, 'code'):
            print 'We failed with error code - %s.' % e.code
        elif hasattr(e, 'reason'):
            print "The error object has the following 'reason' attribute :"
            print e.reason
            print "This usually means the server doesn't exist,",
            print "is down, or we don't have an internet connection."
            return False

    else:
        print
        if cj is None:
            print "We don't have a cookie library available - sorry."
            print "I can't show you any cookies."
        else:
            print 'These are the cookies we have received so far :'
            for index, cookie in enumerate(cj):
                print index, '  :  ', cookie
                cj.save(COOKIEFILE)           # save the cookies again

        page = handle.read()
        return (page)

def fetch_series():

  useragent="Firefox...etc."
  url="www.example.com/01.html"
  try:
    fetch_page(url,useragent)
  except [something]:
    print "failed to get page"
    sys.exit()

底部函数只是一个例子，看看我的意思，谁能告诉我应该放在那里什么？我让页面获取函数在收到 404 错误时返回 False，这是正确的吗？那么为什么 except False: 不起作用呢？感谢您提供的任何帮助。

好吧，按照我尝试过的建议：

except urlib2.URLError, e:

except URLError, e:

except URLError:

except urllib2.IOError, e:

except IOError, e:

except IOError:

except urllib2.HTTPError, e:

except urllib2.HTTPError:

except HTTPError:

它们都不起作用。

原文

I'm trying to go through a series of numbered data pages using urlib2. What I want to do is use a try statement, but I have little knowledge of it, Judging by reading up a bit, it seems to be based on specific 'names' that are exceptions, eg IOError etc. I don't know what the error code is I'm looking for, which is part of the problem.

I've written / pasted from 'urllib2 the missing manual' my urllib2 page fetching routine thus:

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()

    txheaders =  {'User-agent' : useragent}

    if os.path.isfile(COOKIEFILE):
        cj.load(COOKIEFILE)
        print "previous cookie loaded..."
    else:
        print "no ospath to cookfile"

    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)
    try:
        req = urllib2.Request(url, useragent)
        # create a request object

        handle = urlopen(req)
        # and open it to return a handle on the url

    except IOError, e:
        print 'Failed to open "%s".' % url
        if hasattr(e, 'code'):
            print 'We failed with error code - %s.' % e.code
        elif hasattr(e, 'reason'):
            print "The error object has the following 'reason' attribute :"
            print e.reason
            print "This usually means the server doesn't exist,",
            print "is down, or we don't have an internet connection."
            return False

    else:
        print
        if cj is None:
            print "We don't have a cookie library available - sorry."
            print "I can't show you any cookies."
        else:
            print 'These are the cookies we have received so far :'
            for index, cookie in enumerate(cj):
                print index, '  :  ', cookie
                cj.save(COOKIEFILE)           # save the cookies again

        page = handle.read()
        return (page)

def fetch_series():

  useragent="Firefox...etc."
  url="www.example.com/01.html"
  try:
    fetch_page(url,useragent)
  except [something]:
    print "failed to get page"
    sys.exit()

The bottom function is just an example to see what I mean, can anyone tell me what I should be putting there ? I made the page fetching function return False if it gets a 404, is this correct ? So why doesn't except False: work ? Thanks for any help you can give.

ok well as per advice here ive tried:

except urlib2.URLError, e:

except URLError, e:

except URLError:

except urllib2.IOError, e:

except IOError, e:

except IOError:

except urllib2.HTTPError, e:

except urllib2.HTTPError:

except HTTPError:

none of them work.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玩套路吗 2024-12-25 01:53:41

如果你想检测 404，你应该捕获 urllib2.HTTPError：

try:
    req = urllib2.Request(url, useragent)
    # create a request object

    handle = urllib2.urlopen(req)
    # and open it to return a handle on the url
except urllib2.HTTPError, e:
    print 'We failed with error code - %s.' % e.code

    if e.code == 404:
        # do stuff..  
    else:
        # other stuff...

    return False
else:
    # ...

要在 fetch_series() 中捕获它：

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()
    try:
        urlopen()
        #...
    except IOError, e:
        # ...   
    else:
        #...

def fetch_series(): 
    useragent=”Firefox...etc.”
    url=”www.example.com/01.html
    try:
        fetch_page(url,useragent)
    except urllib2.HTTPError, e:
        print “failed to get page”

http://docs.python.org/library/urllib2.html：

异常 urllib2.HTTPError
尽管是一个异常（URLError 的子类），HTTPError 可以
也可用作非异常的类似文件的返回值（相同
urlopen() 返回的东西）。这在处理异国情调时很有用
HTTP 错误，例如身份验证请求。
代码
RFC 2616 中定义的 HTTP 状态代码。该数值对应于在代码字典中找到的值
在BaseHTTPServer.BaseHTTPRequestHandler.responses中。

You should catch urllib2.HTTPError if you want to detect a 404:

try:
    req = urllib2.Request(url, useragent)
    # create a request object

    handle = urllib2.urlopen(req)
    # and open it to return a handle on the url
except urllib2.HTTPError, e:
    print 'We failed with error code - %s.' % e.code

    if e.code == 404:
        # do stuff..  
    else:
        # other stuff...

    return False
else:
    # ...

To catch it in fetch_series():

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()
    try:
        urlopen()
        #...
    except IOError, e:
        # ...   
    else:
        #...

def fetch_series(): 
    useragent=”Firefox...etc.”
    url=”www.example.com/01.html
    try:
        fetch_page(url,useragent)
    except urllib2.HTTPError, e:
        print “failed to get page”

http://docs.python.org/library/urllib2.html:

exception urllib2.HTTPError
Though being an exception (a subclass of URLError), an HTTPError can
also function as a non-exceptional file-like return value (the same
thing that urlopen() returns). This is useful when handling exotic
HTTP errors, such as requests for authentication.
code
An HTTP status code as defined in RFC 2616. This numeric value corresponds to a value found in the dictionary of codes as found
in BaseHTTPServer.BaseHTTPRequestHandler.responses.

回复收藏 0 原文

懒的傷心 2024-12-25 01:53:41

我建议您查看精彩的 requests 模块。

有了它，您可以实现您所要求的功能，如下所示：

import requests
from requests.exceptions import HTTPError

try:
    r = requests.get('http://httpbin.org/status/200')
    r.raise_for_status()
except HTTPError:
    print 'Could not download page'
else:
    print r.url, 'downloaded successfully'

try:
    r = requests.get('http://httpbin.org/status/404')
    r.raise_for_status()
except HTTPError:
    print 'Could not download', r.url
else:
    print r.url, 'downloaded successfully'

I recommend you check out the wonderful requests module.

With it you could achieve the functionality you are asking about like so:

import requests
from requests.exceptions import HTTPError

try:
    r = requests.get('http://httpbin.org/status/200')
    r.raise_for_status()
except HTTPError:
    print 'Could not download page'
else:
    print r.url, 'downloaded successfully'

try:
    r = requests.get('http://httpbin.org/status/404')
    r.raise_for_status()
except HTTPError:
    print 'Could not download', r.url
else:
    print r.url, 'downloaded successfully'

回复收藏 0 原文

意中人 2024-12-25 01:53:41

交互式戳：

要了解 python 中此类异常的性质和可能的内容，最好只需交互式地尝试关键调用：

>>> f = urllib2.urlopen('http://httpbin.org/status/404')
Traceback (most recent call last):
...
  File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: NOT FOUND

然后 sys.last_value 包含下降到交互式的异常值 - 并且可以是玩过：
（使用 TAB + 。交互式 shell 的自动扩展、dir()、vars() ...）

>>> ev = sys.last_value
>>> ev.__class__
<class 'urllib2.HTTPError'>
>>> dir(ev)
['_HTTPError__super_init', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', 'args', 'close', 'code', 'errno', 'filename', 'fileno', 'fp', 'getcode', 'geturl', 'hdrs', 'headers', 'info', 'message', 'msg', 'next', 'read', 'readline', 'readlines', 'reason', 'strerror', 'url']
>>> vars(ev)
{'fp': <addinfourl at 140193880 whose fp = <socket._fileobject object at 0x01062370>>, 'fileno': <bound method _fileobject.fileno of <socket._fileobject object at 0x01062370>>, 'code': 404, 'hdrs': <httplib.HTTPMessage instance at 0x085ADF80>, 'read': <bound method _fileobject.read of <socket._fileobject object at 0x01062370>>, 'readlines': <bound method _fileobject.readlines of <socket._fileobject object at 0x01062370>>, 'next': <bound method _fileobject.next of <socket._fileobject object at 0x01062370>>, 'headers': <httplib.HTTPMessage instance at 0x085ADF80>, '__iter__': <bound method _fileobject.__iter__ of <socket._fileobject object at 0x01062370>>, 'url': 'http://httpbin.org/status/404', 'msg': 'NOT FOUND', 'readline': <bound method _fileobject.readline of <socket._fileobject object at 0x01062370>>}
>>> sys.last_value.code
404

尝试处理：

>>> try: f = urllib2.urlopen('http://httpbin.org/status/404')
... except urllib2.HTTPError, ev:
...     print ev, "'s error code is", ev.code
...     
HTTP Error 404: NOT FOUND 's error code is 404

构建一个不会引发 HTTP 错误的简单开启器：

>>> ho = urllib2.OpenerDirector()
>>> ho.add_handler(urllib2.HTTPHandler())
>>> f = ho.open('http://localhost:8080/cgi/somescript.py'); f
<addinfourl at 138851272 whose fp = <socket._fileobject object at 0x01062370>>
>>> f.code
500
>>> f.read()
'Execution error: <pre style="background-color:#faa">\nNameError: name \'e\' is not defined\n<pre>\n'

urllib2.build_opener< 的默认处理程序/代码>：

default_classes = [ProxyHandler、UnknownHandler、HTTPHandler、
HTTPDefaultErrorHandler、HTTPRedirectHandler、
FTPHandler、FileHandler、HTTPErrorProcessor]

Interactive poking:

For finding out about the nature and possible content of such exceptions in python best simply try the key calls interactively:

>>> f = urllib2.urlopen('http://httpbin.org/status/404')
Traceback (most recent call last):
...
  File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: NOT FOUND

Then sys.last_value contains the exception value which fell down to the interactive - and can be played with:
( use TAB + . auto-expansion of the interactive shell, dir(), vars() ...)

>>> ev = sys.last_value
>>> ev.__class__
<class 'urllib2.HTTPError'>
>>> dir(ev)
['_HTTPError__super_init', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', 'args', 'close', 'code', 'errno', 'filename', 'fileno', 'fp', 'getcode', 'geturl', 'hdrs', 'headers', 'info', 'message', 'msg', 'next', 'read', 'readline', 'readlines', 'reason', 'strerror', 'url']
>>> vars(ev)
{'fp': <addinfourl at 140193880 whose fp = <socket._fileobject object at 0x01062370>>, 'fileno': <bound method _fileobject.fileno of <socket._fileobject object at 0x01062370>>, 'code': 404, 'hdrs': <httplib.HTTPMessage instance at 0x085ADF80>, 'read': <bound method _fileobject.read of <socket._fileobject object at 0x01062370>>, 'readlines': <bound method _fileobject.readlines of <socket._fileobject object at 0x01062370>>, 'next': <bound method _fileobject.next of <socket._fileobject object at 0x01062370>>, 'headers': <httplib.HTTPMessage instance at 0x085ADF80>, '__iter__': <bound method _fileobject.__iter__ of <socket._fileobject object at 0x01062370>>, 'url': 'http://httpbin.org/status/404', 'msg': 'NOT FOUND', 'readline': <bound method _fileobject.readline of <socket._fileobject object at 0x01062370>>}
>>> sys.last_value.code
404

Try handling:

>>> try: f = urllib2.urlopen('http://httpbin.org/status/404')
... except urllib2.HTTPError, ev:
...     print ev, "'s error code is", ev.code
...     
HTTP Error 404: NOT FOUND 's error code is 404

Building a simple opener which doesn't throw HTTP errors:

>>> ho = urllib2.OpenerDirector()
>>> ho.add_handler(urllib2.HTTPHandler())
>>> f = ho.open('http://localhost:8080/cgi/somescript.py'); f
<addinfourl at 138851272 whose fp = <socket._fileobject object at 0x01062370>>
>>> f.code
500
>>> f.read()
'Execution error: <pre style="background-color:#faa">\nNameError: name \'e\' is not defined\n<pre>\n'

The default handlers of urllib2.build_opener:

default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
HTTPDefaultErrorHandler, HTTPRedirectHandler,
FTPHandler, FileHandler, HTTPErrorProcessor]

回复收藏 0 原文

~没有更多了~