Python:从 urllib2.urlopen 调用获取 HTTP 标头?
当进行 urlopen
调用时,urllib2
是否会获取整个页面?
我只想读取 HTTP 响应标头而不获取页面。 看起来 urllib2 打开 HTTP 连接,然后获取实际的 HTML 页面...或者它只是通过 urlopen 调用开始缓冲页面?
import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers
html = page.readlines() // stream page
Does urllib2
fetch the whole page when a urlopen
call is made?
I'd like to just read the HTTP response header without getting the page. It looks like urllib2
opens the HTTP connection and then subsequently gets the actual HTML page... or does it just start buffering the page with the urlopen
call?
import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers
html = page.readlines() // stream page
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
使用
response.info()
方法获取标头。来自 urllib2 文档:
因此,对于您的示例,请尝试单步执行
的结果response.info().headers
查找您要查找的内容。请注意,使用 httplib.HTTPMessage 的主要警告记录在 python 问题 4773 中。
Use the
response.info()
method to get the headers.From the urllib2 docs:
So, for your example, try stepping through the result of
response.info().headers
for what you're looking for.Note the major caveat to using httplib.HTTPMessage is documented in python issue 4773.
发送 HEAD 请求而不是普通的 GET 请求怎么样? 以下片段(从类似的问题复制而来) 正是这样做的。
What about sending a HEAD request instead of a normal GET request. The following snipped (copied from a similar question) does exactly that.
实际上,urllib2 似乎可以执行 HTTP HEAD 请求。
@reto 链接到的问题,上面展示了如何让 urllib2 执行 HEAD 请求。
这是我的看法:
如果您使用 Wireshark 网络协议分析器之类的工具检查这一点,您会发现它实际上发送的是 HEAD 请求,而不是 GET。
这是上面代码中的 HTTP 请求和响应,由 Wireshark 捕获:
但是,正如另一个问题的评论之一中提到的,如果相关 URL 包含重定向,则 urllib2 将对目的地执行 GET 请求,而不是 HEAD。 如果您真的只想发出 HEAD 请求,这可能是一个主要缺点。
上面的请求涉及到重定向。 以下是 Wireshark 捕获的对目标的请求:
使用 urllib2 的替代方法是使用 Joe Gregorio 的 httplib2 库:
该库的优点是对初始 HTTP 请求和对目标 URL 的重定向请求都使用 HEAD 请求。
这是第一个请求:
这是发送到目的地的第二个请求:
Actually, it appears that urllib2 can do an HTTP HEAD request.
The question that @reto linked to, above, shows how to get urllib2 to do a HEAD request.
Here's my take on it:
If you check this with something like the Wireshark network protocol analazer, you can see that it is actually sending out a HEAD request, rather than a GET.
This is the HTTP request and response from the code above, as captured by Wireshark:
However, as mentioned in one of the comments in the other question, if the URL in question includes a redirect then urllib2 will do a GET request to the destination, not a HEAD. This could be a major shortcoming, if you really wanted to only make HEAD requests.
The request above involves a redirect. Here is request to the destination, as captured by Wireshark:
An alternative to using urllib2 is to use Joe Gregorio's httplib2 library:
This has the advantage of using HEAD requests for both the initial HTTP request and the redirected request to the destination URL.
Here's the first request:
Here's the second request, to the destination:
urllib2.urlopen 执行 HTTP GET(或 POST,如果您提供数据参数),而不是 HTTP HEAD(如果执行后者,当然,您无法对页面主体执行 readlines 或其他访问)。
urllib2.urlopen does an HTTP GET (or POST if you supply a data argument), not an HTTP HEAD (if it did the latter, you couldn't do readlines or other accesses to the page body, of course).
单线:
One-liner: