在Python中,如何解码GZIP编码?

发布于 2024-08-30 03:17:28 字数 125 浏览 3 评论 0 原文

我在 python 脚本中下载了一个网页。 在大多数情况下,这工作得很好。

然而,这个有一个响应头:GZIP 编码,当我尝试打印这个网页的源代码时,它在我的腻子中包含了所有符号。

如何将其解码为常规文本?

I downloaded a webpage in my python script.
In most cases, this works fine.

However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.

How do decode this to regular text?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

谎言 2024-09-06 03:17:28

我使用 zlib 从网络上解压缩 gzip 内容。

import zlib
import urllib

f=urllib.request.urlopen(url) 
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)

I use zlib to decompress gzipped content from web.

import zlib
import urllib

f=urllib.request.urlopen(url) 
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)
半衬遮猫 2024-09-06 03:17:28

使用内置 gzip 模块解压缩字节流。

如果您有任何问题,请显示您使用的确切的最小代码、确切的错误消息和回溯,以及 print repr(your_byte_stream[:100])

更多信息< /strong>

1. 有关 gzip/zlib/deflate 混淆的解释,请阅读 这篇维基百科文章

2. 如果您有字符串而不是文件,则使用 zlib 模块比使用 gzip 模块更容易。不幸的是,Python 文档不完整/错误:

zlib.decompress(string[, wbits[, bufsize]])

...wbits 的绝对值是压缩数据时使用的历史缓冲区大小(“窗口大小”)的以 2 为底的对数。对于最新版本的 zlib 库,其绝对值应在 8 到 15 之间,值越大,压缩效果越好,但内存使用量越大。默认值为15。当wbits为负数时,标准gzip头被抑制;这是 zlib 库的一个未记录的功能,用于与 unzip 的压缩文件格式兼容。

首先,8 <= log2_window_size <= 15,其含义如上。然后,应该是一个单独的 arg 的内容被混在一起:

arg == log2_window_size 意味着假设字符串采用 zlib 格式(RFC 1950;HTTP 1.1 RFC 2616 混淆地称为“deflate”)。

arg == -log2_window_size 表示假设字符串采用 deflate 格式(RFC 1951;没有仔细阅读 HTTP 1.1 RFC 的人实际实现的)

arg == 16 + log_2_window_size 表示假设字符串采用 gzip 格式(RFC 1952)。所以你可以使用31。

以上信息记录在zlib C库手册中... Ctrl -F 搜索windowBits

Decompress your byte stream using the built-in gzip module.

If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])

Further information

1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article.

2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. Unfortunately the Python docs are incomplete/wrong:

zlib.decompress(string[, wbits[, bufsize]])

...The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. The default value is 15. When wbits is negative, the standard gzip header is suppressed; this is an undocumented feature of the zlib library, used for compatibility with unzip‘s compression file format.

Firstly, 8 <= log2_window_size <= 15, with the meaning given above. Then what should be a separate arg is kludged on top:

arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate").

arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented)

arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). So you can use 31.

The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits.

农村范ル 2024-09-06 03:17:28

对于 Python 3

试试这个:

import gzip

fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

For Python 3

Try out this:

import gzip

fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')
你的笑 2024-09-06 03:17:28

我用类似的东西:

f = urllib2.urlopen(request)
data = f.read()
try:
    from cStringIO import StringIO
    from gzip import GzipFile
    data2 = GzipFile('', 'r', 0, StringIO(data)).read()
    data = data2
except:
    #print "decompress error %s" % err
    pass
return data

I use something like that:

f = urllib2.urlopen(request)
data = f.read()
try:
    from cStringIO import StringIO
    from gzip import GzipFile
    data2 = GzipFile('', 'r', 0, StringIO(data)).read()
    data = data2
except:
    #print "decompress error %s" % err
    pass
return data
夜深人未静 2024-09-06 03:17:28

如果您使用 Requests 模块,则您不需要使用任何其他模块,因为 gzipdeflate 传输编码 自动为您解码

示例

>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...

响应.text属性用于读取text上下文中的内容。

响应.content属性用于读取二进制上下文中的内容。

请参阅二进制响应内容部分< a href="http://docs.python-requests.org" rel="noreferrer">docs.python-requests.org

If you use the Requests module, then you don't need to use any other modules because the gzip and deflate transfer-encodings are automatically decoded for you.

Example:

>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...

The .text property of the response is for reading the content in the text context.

The .content property of the response is for reading the content in the binary context.

See the Binary Response Content section on docs.python-requests.org

优雅的叶子 2024-09-06 03:17:28

这些答案都没有使用 Python 3 开箱即用。以下是我获取页面并解码 gzip 响应的方法:

import requests
import gzip

response = requests.get('your-url-here')
data = str(gzip.decompress(response.content), 'utf-8')
print(data)  # decoded contents of page

None of these answers worked out of the box using Python 3. Here is what worked for me to fetch a page and decode the gzipped response:

import requests
import gzip

response = requests.get('your-url-here')
data = str(gzip.decompress(response.content), 'utf-8')
print(data)  # decoded contents of page
守不住的情 2024-09-06 03:17:28

与 Shatu 对 python3 的回答类似,但安排略有不同:

import gzip

s = Request("https://someplace.com", None, headers)
r = urlopen(s, None, 180).read()
try: r = gzip.decompress(r)
except OSError: pass
result = json_load(r.decode())

此方法允许将 gzip.decompress() 包装在 try/ except 中,以捕获并传递 OSError,从而导致您可能会获得混合压缩和未压缩数据的情况。一些小字符串如果经过编码实际上会变得更大,因此会发送纯数据。

Similar to Shatu's answer for python3, but arranged a little differently:

import gzip

s = Request("https://someplace.com", None, headers)
r = urlopen(s, None, 180).read()
try: r = gzip.decompress(r)
except OSError: pass
result = json_load(r.decode())

This method allows for wrapping the gzip.decompress() in a try/except to capture and pass the OSError that results in situations where you may get mixed compressed and uncompressed data. Some small strings actually get bigger if they are encoded, so the plain data is sent instead.

旧时光的容颜 2024-09-06 03:17:28

此版本很简单,通过不调用 read() 方法来避免首先读取整个文件。它提供了一个类似文件流的对象,其行为就像普通的文件流一样。

import gzip
from urllib.request import urlopen

my_gzip_url = 'http://my_url.gz'
my_gzip_stream = urlopen(my_gzip_url)
my_stream = gzip.open(my_gzip_stream, 'r')

This version is simple and avoids reading the whole file first by not calling the read() method. It provides a file stream like object instead that behaves just like a normal file stream.

import gzip
from urllib.request import urlopen

my_gzip_url = 'http://my_url.gz'
my_gzip_stream = urlopen(my_gzip_url)
my_stream = gzip.open(my_gzip_stream, 'r')
玩世 2024-09-06 03:17:28

您可以使用 urllib3 轻松解码 gzip。

urllib3.response.decode_gzip(response.data)

You can use urllib3 to easily decode gzip.

urllib3.response.decode_gzip(response.data)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文