使用 python 解压 .gz 文件的一部分
所以问题就在这里。我有 Sample.gz 文件,大小约为 60KB。我想解压缩该文件的前 2000 字节。我遇到了 CRC 检查失败错误,我猜是因为 gzip CRC 字段出现在文件末尾,并且它需要整个 gzip 压缩文件才能解压缩。有办法解决这个问题吗?我不关心CRC校验。即使我因为 CRC 错误而无法解压缩,也没关系。有没有办法解决这个问题并解压缩部分 .gz 文件?
到目前为止我的代码是
import gzip
import time
import StringIO
file = open('sample.gz', 'rb')
mybuf = MyBuffer(file)
mybuf = StringIO.StringIO(file.read(2000))
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()
print data
遇到的错误是
File "gunzip.py", line 27, in ?
data = f.read()
File "/usr/local/lib/python2.4/gzip.py", line 218, in read
self._read(readsize)
File "/usr/local/lib/python2.4/gzip.py", line 273, in _read
self._read_eof()
File "/usr/local/lib/python2.4/gzip.py", line 309, in _read_eof
raise IOError, "CRC check failed"
IOError: CRC check failed
还有什么方法可以使用 zlib 模块来执行此操作并忽略 gzip 标头?
So here's the problem. I have sample.gz file which is roughly 60KB in size. I want to decompress the first 2000 bytes of this file. I am running into CRC check failed error, I guess because the gzip CRC field appears at the end of file, and it requires the entire gzipped file to decompress. Is there a way to get around this? I don't care about the CRC check. Even if I fail to decompress because of bad CRC, that is OK. Is there a way to get around this and unzip partial .gz files?
The code I have so far is
import gzip
import time
import StringIO
file = open('sample.gz', 'rb')
mybuf = MyBuffer(file)
mybuf = StringIO.StringIO(file.read(2000))
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()
print data
The error encountered is
File "gunzip.py", line 27, in ?
data = f.read()
File "/usr/local/lib/python2.4/gzip.py", line 218, in read
self._read(readsize)
File "/usr/local/lib/python2.4/gzip.py", line 273, in _read
self._read_eof()
File "/usr/local/lib/python2.4/gzip.py", line 309, in _read_eof
raise IOError, "CRC check failed"
IOError: CRC check failed
Also is there any way to use zlib module to do this and ignore the gzip headers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
gzip 模块的问题并不是它无法解压缩部分文件,而是仅在最后尝试验证解压缩内容的校验和时才会出现错误。 (原始校验和存储在压缩文件的末尾,因此验证永远不会对部分文件起作用。)
关键是欺骗 gzip 跳过验证。 caesar0301的回答通过修改gzip源代码来做到这一点,但没有必要走那么远,简单的猴子修补就可以了做。我编写了这个上下文管理器来在解压缩部分文件时临时替换
gzip.GzipFile._read_eof
:示例用法:
The issue with the gzip module is not that it can't decompress the partial file, the error occurs only at the end when it tries to verify the checksum of the decompressed content. (The original checksum is stored at the end of the compressed file so the verification will never, ever work with a partial file.)
The key is to trick gzip into skipping the verification. The answer by caesar0301 does this by modifying the gzip source code, but it's not necessary to go that far, simple monkey patching will do. I wrote this context manager to temporarily replace
gzip.GzipFile._read_eof
while I decompress the partial file:An example usage:
我似乎您需要研究 Python zlib 库相反,
GZIP 格式依赖于 zlib,但引入了文件级压缩概念以及 CRC 检查,这似乎是您目前不想要/不需要的。
例如,请参阅这些来自 Dough Hellman 的代码片段
编辑:Doubh Hellman 网站上的代码仅显示如何使用 zlib 压缩或解压缩。如上所述,GZIP 是“带有信封的 zlib”,您需要在获取 zlib 压缩数据本身之前对信封进行解码。这里有更多信息,它实际上并不那么复杂:
抱歉,既没有提供简单的过程,也没有提供现成的代码片段,但是使用上述指示解码文件应该相对快速且简单。
I seems that you need to look into Python zlib library instead
The GZIP format relies on zlib, but introduces a file-level compression concept along with CRC checking, and this appears to be what you do not want/need at the moment.
See for example these code snippets from Dough Hellman
Edit: the code on Doubh Hellman's site only show how to compress or decompress with zlib. As indicated above, GZIP is "zlib with an envelope", and you'll need to decode the envellope before getting to the zlib-compressed data per se. Here's more info to go about it, it's really not that complicated:
Sorry to provide neither an simple procedure nor a ready-to-go snippet, however decoding the file with the indication above should be relatively quick and simple.
我看不出您想要解压缩前 2000 个压缩字节的任何可能原因。根据数据的不同,这可能会解压缩为任意数量的输出字节。
当然,您想要解压缩文件,并在解压缩所需的文件大小时停止,例如:
AFAIK,这不会导致读取整个文件。它只会读取获取前 4000 字节所需的数据。
I can't see any possible reason why you would want to decompress the first 2000 compressed bytes. Depending on the data, this may uncompress to any number of output bytes.
Surely you want to uncompress the file, and stop when you have uncompressed as much of the file as you need, something like:
AFAIK, this won't cause the whole file to be read. It will only read as much as is necessary to get the first 4000 bytes.
我在Linux下使用python脚本读取gzip工具生成的压缩文件时也遇到这个问题,原始文件丢失。
通过阅读Python的gzip.py的实现,我发现gzip.GzipFile 具有与 File 类类似的方法,并利用 python zip 模块来处理数据解压/压缩。同时,还存在_read_eof()方法来检查每个文件的CRC。
但在某些情况下,例如在没有正确 CRC 的情况下处理 Stream 或 .gz 文件(我的问题),_read_eof() 将引发 IOError("CRC check failed")。因此,我尝试修改gzip模块以禁用CRC检查,最终这个问题消失了。
https://github.com/caesar0301/PcapEx/blob/master /live-scripts/gzip_mod.py
我知道这是一个蛮力解决方案,但它可以节省很多时间,使用 zip 模块重写一些低级方法,例如从压缩文件中逐个读取数据,逐行提取数据,其中大部分已经存在于 gzip 模块中。
贾敏
I also encounter this problem when I use my python script to read compressed files generated by gzip tool under Linux and the original files were lost.
By reading the implementation of gzip.py of Python, I found that gzip.GzipFile had similar methods of File class and exploited python zip module to process data de/compressing. At the same time, the _read_eof() method is also present to check the CRC of each file.
But in some situations, like processing Stream or .gz file without correct CRC (my problem), an IOError("CRC check failed") will be raised by _read_eof(). Therefore, I try to modify the gzip module to disable the CRC check and finally this problem disappeared.
https://github.com/caesar0301/PcapEx/blob/master/live-scripts/gzip_mod.py
I know it's a brute-force solution, but it save much time to rewrite yourself some low level methods using the zip module, like of reading data chuck by chuck from the zipped files and extract the data line by line, most of which has been present in the gzip module.
Jamin