Python:从 urllib2.urlopen 调用获取 HTTP 标头?

发布于 2024-07-19 04:36:52 字数 403 浏览 11 评论 0原文

当进行 urlopen 调用时,urllib2 是否会获取整个页面?

我只想读取 HTTP 响应标头而不获取页面。 看起来 urllib2 打开 HTTP 连接,然后获取实际的 HTML 页面...或者它只是通过 urlopen 调用开始缓冲页面?

import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers

html = page.readlines()  // stream page

Does urllib2 fetch the whole page when a urlopen call is made?

I'd like to just read the HTTP response header without getting the page. It looks like urllib2 opens the HTTP connection and then subsequently gets the actual HTML page... or does it just start buffering the page with the urlopen call?

import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers

html = page.readlines()  // stream page

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

安人多梦 2024-07-26 04:36:52

使用 response.info() 方法获取标头。

来自 urllib2 文档

urllib2.urlopen(url[, 数据][, 超时])

...

此函数返回一个类似文件的对象,具有两个附加方法:

  • geturl() — 返回检索到的资源的 URL,通常用于确定是否遵循重定向
  • info() — 以 httplib.HTTPMessage 实例的形式返回页面的元信息,例如标头(请参阅 HTTP 标头快速参考)

因此,对于您的示例,请尝试单步执行 的结果response.info().headers 查找您要查找的内容。

请注意,使用 httplib.HTTPMessage 的主要警告记录在 python 问题 4773 中。

Use the response.info() method to get the headers.

From the urllib2 docs:

urllib2.urlopen(url[, data][, timeout])

...

This function returns a file-like object with two additional methods:

  • geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
  • info() — return the meta-information of the page, such as headers, in the form of an httplib.HTTPMessage instance (see Quick Reference to HTTP Headers)

So, for your example, try stepping through the result of response.info().headers for what you're looking for.

Note the major caveat to using httplib.HTTPMessage is documented in python issue 4773.

哆兒滾 2024-07-26 04:36:52

发送 HEAD 请求而不是普通的 GET 请求怎么样? 以下片段(从类似的问题复制而来) 正是这样做的。

>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]

What about sending a HEAD request instead of a normal GET request. The following snipped (copied from a similar question) does exactly that.

>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]
你的他你的她 2024-07-26 04:36:52

实际上,urllib2 似乎可以执行 HTTP HEAD 请求。

@reto 链接到的问题,上面展示了如何让 urllib2 执行 HEAD 请求。

这是我的看法:

import urllib2

# Derive from Request class and override get_method to allow a HEAD request.
class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

myurl = 'http://bit.ly/doFeT'
request = HeadRequest(myurl)

try:
    response = urllib2.urlopen(request)
    response_headers = response.info()

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response_headers.dict

except urllib2.HTTPError, e:
    # Prints the HTTP Status code of the response but only if there was a 
    # problem.
    print ("Error code: %s" % e.code)

如果您使用 Wireshark 网络协议分析器之类的工具检查这一点,您会发现它实际上发送的是 HEAD 请求,而不是 GET。

这是上面代码中的 HTTP 请求和响应,由 Wireshark 捕获:

HEAD /doFeT HTTP/1.1
接受编码:身份
主机:
bit.ly
连接:关闭
用户代理:Python-urllib/2.7

HTTP/1.1 301 已移动
服务器:nginx
日期:2012 年 2 月 19 日星期日
13:20:56 GMT
内容类型:text/html; 字符集=utf-8
缓存控制:私有; max-age=90
地点:
http://www.kidsidebyside.org/?p=445
MIME 版本:1.0
内容长度:127
连接:关闭
设置 Cookie:
_bit=4f40f738-00153-02ed0-421cf10a;domain=.bit.ly;expires=2012 年 8 月 17 日星期五 13:20:56;path=/; 仅 Http

但是,正如另一个问题的评论之一中提到的,如果相关 URL 包含重定向,则 urllib2 将对目的地执行 GET 请求,而不是 HEAD。 如果您真的只想发出 HEAD 请求,这可能是一个主要缺点。

上面的请求涉及到重定向。 以下是 Wireshark 捕获的对目标的请求:

GET /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
接受编码:身份
主机:www.kidsidebyside.org
连接:关闭
用户代理:Python-urllib/2.7

使用 urllib2 的替代方法是使用 Joe Gregorio 的 httplib2 库:

import httplib2

url = "http://bit.ly/doFeT"
http_interface = httplib2.Http()

try:
    response, content = http_interface.request(url, method="HEAD")
    print ("Response status: %d - %s" % (response.status, response.reason))

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response.__dict__

except httplib2.ServerNotFoundError, e:
    print (e.message)

该库的优点是对初始 HTTP 请求和对目标 URL 的重定向请求都使用 HEAD 请求。

这是第一个请求:

HEAD /doFeT HTTP/1.1
主机:bit.ly
接受编码:gzip,
deflate
用户代理:Python-httplib2/0.7.2 (gzip)

这是发送到目的地的第二个请求:

HEAD /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
主机:www.kidsidebyside.org
接受编码:gzip、deflate
用户代理:Python-httplib2/0.7.2 (gzip)

Actually, it appears that urllib2 can do an HTTP HEAD request.

The question that @reto linked to, above, shows how to get urllib2 to do a HEAD request.

Here's my take on it:

import urllib2

# Derive from Request class and override get_method to allow a HEAD request.
class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

myurl = 'http://bit.ly/doFeT'
request = HeadRequest(myurl)

try:
    response = urllib2.urlopen(request)
    response_headers = response.info()

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response_headers.dict

except urllib2.HTTPError, e:
    # Prints the HTTP Status code of the response but only if there was a 
    # problem.
    print ("Error code: %s" % e.code)

If you check this with something like the Wireshark network protocol analazer, you can see that it is actually sending out a HEAD request, rather than a GET.

This is the HTTP request and response from the code above, as captured by Wireshark:

HEAD /doFeT HTTP/1.1
Accept-Encoding: identity
Host:
bit.ly
Connection: close
User-Agent: Python-urllib/2.7

HTTP/1.1 301 Moved
Server: nginx
Date: Sun, 19 Feb 2012
13:20:56 GMT
Content-Type: text/html; charset=utf-8
Cache-control: private; max-age=90
Location:
http://www.kidsidebyside.org/?p=445
MIME-Version: 1.0
Content-Length: 127
Connection: close
Set-Cookie:
_bit=4f40f738-00153-02ed0-421cf10a;domain=.bit.ly;expires=Fri Aug 17 13:20:56 2012;path=/; HttpOnly

However, as mentioned in one of the comments in the other question, if the URL in question includes a redirect then urllib2 will do a GET request to the destination, not a HEAD. This could be a major shortcoming, if you really wanted to only make HEAD requests.

The request above involves a redirect. Here is request to the destination, as captured by Wireshark:

GET /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Accept-Encoding: identity
Host: www.kidsidebyside.org
Connection: close
User-Agent: Python-urllib/2.7

An alternative to using urllib2 is to use Joe Gregorio's httplib2 library:

import httplib2

url = "http://bit.ly/doFeT"
http_interface = httplib2.Http()

try:
    response, content = http_interface.request(url, method="HEAD")
    print ("Response status: %d - %s" % (response.status, response.reason))

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response.__dict__

except httplib2.ServerNotFoundError, e:
    print (e.message)

This has the advantage of using HEAD requests for both the initial HTTP request and the redirected request to the destination URL.

Here's the first request:

HEAD /doFeT HTTP/1.1
Host: bit.ly
accept-encoding: gzip,
deflate
user-agent: Python-httplib2/0.7.2 (gzip)

Here's the second request, to the destination:

HEAD /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Host: www.kidsidebyside.org
accept-encoding: gzip, deflate
user-agent: Python-httplib2/0.7.2 (gzip)

往日情怀 2024-07-26 04:36:52

urllib2.urlopen 执行 HTTP GET(或 POST,如果您提供数据参数),而不是 HTTP HEAD(如果执行后者,当然,您无法对页面主体执行 readlines 或其他访问)。

urllib2.urlopen does an HTTP GET (or POST if you supply a data argument), not an HTTP HEAD (if it did the latter, you couldn't do readlines or other accesses to the page body, of course).

樱花坊 2024-07-26 04:36:52

单线:

$ python -c "import urllib2; print urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)).open(urllib2.Request('http://google.com'))"

One-liner:

$ python -c "import urllib2; print urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)).open(urllib2.Request('http://google.com'))"
世俗缘 2024-07-26 04:36:52
def _GetHtmlPage(self, addr):
  headers = { 'User-Agent' : self.userAgent,
            '  Cookie' : self.cookies}

  req = urllib2.Request(addr)
  response = urllib2.urlopen(req)

  print "ResponseInfo="
  print response.info()

  resultsHtml = unicode(response.read(), self.encoding)
  return resultsHtml  
def _GetHtmlPage(self, addr):
  headers = { 'User-Agent' : self.userAgent,
            '  Cookie' : self.cookies}

  req = urllib2.Request(addr)
  response = urllib2.urlopen(req)

  print "ResponseInfo="
  print response.info()

  resultsHtml = unicode(response.read(), self.encoding)
  return resultsHtml  
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文