urllib2 尝试并排除 404
我正在尝试使用 urlib2 浏览一系列编号的数据页。我想做的是使用 try 语句,但我对它知之甚少,通过阅读一点来看,它似乎是基于异常的特定“名称”,例如 IOError 等。我不知道什么错误代码是我正在寻找的,这是问题的一部分。
我已经从“urllib2 缺失的手册”中编写/粘贴了我的 urllib2 页面获取例程:
def fetch_page(url,useragent)
urlopen = urllib2.urlopen
Request = urllib2.Request
cj = cookielib.LWPCookieJar()
txheaders = {'User-agent' : useragent}
if os.path.isfile(COOKIEFILE):
cj.load(COOKIEFILE)
print "previous cookie loaded..."
else:
print "no ospath to cookfile"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
try:
req = urllib2.Request(url, useragent)
# create a request object
handle = urlopen(req)
# and open it to return a handle on the url
except IOError, e:
print 'Failed to open "%s".' % url
if hasattr(e, 'code'):
print 'We failed with error code - %s.' % e.code
elif hasattr(e, 'reason'):
print "The error object has the following 'reason' attribute :"
print e.reason
print "This usually means the server doesn't exist,",
print "is down, or we don't have an internet connection."
return False
else:
print
if cj is None:
print "We don't have a cookie library available - sorry."
print "I can't show you any cookies."
else:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
cj.save(COOKIEFILE) # save the cookies again
page = handle.read()
return (page)
def fetch_series():
useragent="Firefox...etc."
url="www.example.com/01.html"
try:
fetch_page(url,useragent)
except [something]:
print "failed to get page"
sys.exit()
底部函数只是一个例子,看看我的意思,谁能告诉我应该放在那里什么?我让页面获取函数在收到 404 错误时返回 False,这是正确的吗?那么为什么 except False: 不起作用呢?感谢您提供的任何帮助。
好吧,按照我尝试过的建议:
except urlib2.URLError, e:
except URLError, e:
except URLError:
except urllib2.IOError, e:
except IOError, e:
except IOError:
except urllib2.HTTPError, e:
except urllib2.HTTPError:
except HTTPError:
它们都不起作用。
I'm trying to go through a series of numbered data pages using urlib2. What I want to do is use a try statement, but I have little knowledge of it, Judging by reading up a bit, it seems to be based on specific 'names' that are exceptions, eg IOError etc. I don't know what the error code is I'm looking for, which is part of the problem.
I've written / pasted from 'urllib2 the missing manual' my urllib2 page fetching routine thus:
def fetch_page(url,useragent)
urlopen = urllib2.urlopen
Request = urllib2.Request
cj = cookielib.LWPCookieJar()
txheaders = {'User-agent' : useragent}
if os.path.isfile(COOKIEFILE):
cj.load(COOKIEFILE)
print "previous cookie loaded..."
else:
print "no ospath to cookfile"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
try:
req = urllib2.Request(url, useragent)
# create a request object
handle = urlopen(req)
# and open it to return a handle on the url
except IOError, e:
print 'Failed to open "%s".' % url
if hasattr(e, 'code'):
print 'We failed with error code - %s.' % e.code
elif hasattr(e, 'reason'):
print "The error object has the following 'reason' attribute :"
print e.reason
print "This usually means the server doesn't exist,",
print "is down, or we don't have an internet connection."
return False
else:
print
if cj is None:
print "We don't have a cookie library available - sorry."
print "I can't show you any cookies."
else:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
cj.save(COOKIEFILE) # save the cookies again
page = handle.read()
return (page)
def fetch_series():
useragent="Firefox...etc."
url="www.example.com/01.html"
try:
fetch_page(url,useragent)
except [something]:
print "failed to get page"
sys.exit()
The bottom function is just an example to see what I mean, can anyone tell me what I should be putting there ? I made the page fetching function return False if it gets a 404, is this correct ? So why doesn't except False: work ? Thanks for any help you can give.
ok well as per advice here ive tried:
except urlib2.URLError, e:
except URLError, e:
except URLError:
except urllib2.IOError, e:
except IOError, e:
except IOError:
except urllib2.HTTPError, e:
except urllib2.HTTPError:
except HTTPError:
none of them work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果你想检测 404,你应该捕获 urllib2.HTTPError:
要在 fetch_series() 中捕获它:
http://docs.python.org/library/urllib2.html:
You should catch
urllib2.HTTPError
if you want to detect a 404:To catch it in fetch_series():
http://docs.python.org/library/urllib2.html:
我建议您查看精彩的
requests
模块。有了它,您可以实现您所要求的功能,如下所示:
I recommend you check out the wonderful
requests
module.With it you could achieve the functionality you are asking about like so:
交互式戳:
要了解 python 中此类异常的性质和可能的内容,最好只需交互式地尝试关键调用:
然后
sys.last_value
包含下降到交互式的异常值 - 并且可以是玩过:(使用 TAB + 。交互式 shell 的自动扩展、dir()、vars() ...)
尝试处理:
构建一个不会引发 HTTP 错误的简单开启器:
urllib2.build_opener< 的默认处理程序/代码>:
Interactive poking:
For finding out about the nature and possible content of such exceptions in python best simply try the key calls interactively:
Then
sys.last_value
contains the exception value which fell down to the interactive - and can be played with:( use TAB + . auto-expansion of the interactive shell, dir(), vars() ...)
Try handling:
Building a simple opener which doesn't throw HTTP errors:
The default handlers of
urllib2.build_opener
: