为什么使用 urllib2 打开 url 时出现乱码?
这是我的代码,大家也可以测试一下。我总是得到混乱的字符而不是页面源代码。
Header = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)"}
Req = urllib2.Request("http://rlslog.net", None, Header)
Response = urllib2.urlopen(Req)
Html = Response.read()
print Html[:1000]
通常 Html
应该是页面源代码,但它最终变成了大量混乱的字符。有人知道为什么吗?
顺便说一句:我使用的是 python 2.7
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如布鲁斯已经建议的那样,这似乎是压缩的问题。服务器返回 gzip 压缩内容,但 urllib2 不支持自动 gzip 压缩。事实上,据我所知,服务器在这种情况下行为不当:只有在存在
Accept-encoding: gzip
标头(您可以自己提供,或者自动添加)的情况下,它才应该压缩内容如果您的客户支持的话)。所以:要么使用自动支持它的库,例如 httplib2 (我已经测试过与有问题的页面,它的工作原理),或解压自己(参见这个SO问题了解如何做到这一点,请注意问题中的内容检查服务器返回的标头以查看内容是否经过 gzip 压缩)
As Bruce already suggested, it seems to be a problem with compression. The server returns gzip compressed content, but
urllib2
does not support automatic gzip compression. In fact, the server is misbehaving in this case as far as I know: it should only compress the content if anAccept-encoding: gzip
header is present (which you either provide yourself, or is automatically added by your client if it supports it).So: either use a library that supports it automatically, like httplib2 (which I've tested with the page in question, and it works), or decompress yourself (see the answer to this SO question for how to do it, note that in the question the headers returned by the server are checked to see if the content is gzip compressed)
您可以通过支持动态压缩的用户代理提出请求。您确定输出不是 gzip 压缩的吗?尝试通过 zlib 模块 运行它和/或打印标题
You make your request with a user agent which supports on the fly compression. Are you sure that the output is not gzip compressed ? Try running it through zlib module and/or printing headers