使用 python 解压 .gz 文件的一部分

发布于 2024-08-10 19:01:01 字数 863 浏览 6 评论 0原文

所以问题就在这里。我有 Sample.gz 文件，大小约为 60KB。我想解压缩该文件的前 2000 字节。我遇到了 CRC 检查失败错误，我猜是因为 gzip CRC 字段出现在文件末尾，并且它需要整个 gzip 压缩文件才能解压缩。有办法解决这个问题吗？我不关心CRC校验。即使我因为 CRC 错误而无法解压缩，也没关系。有没有办法解决这个问题并解压缩部分 .gz 文件？

到目前为止我的代码是

import gzip
import time
import StringIO

file = open('sample.gz', 'rb')
mybuf = MyBuffer(file)
mybuf = StringIO.StringIO(file.read(2000))
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()
print data

遇到的错误是

File "gunzip.py", line 27, in ?
    data = f.read()
File "/usr/local/lib/python2.4/gzip.py", line 218, in read
  self._read(readsize)
File "/usr/local/lib/python2.4/gzip.py", line 273, in _read
  self._read_eof()
File "/usr/local/lib/python2.4/gzip.py", line 309, in _read_eof
  raise IOError, "CRC check failed"
IOError: CRC check failed

还有什么方法可以使用 zlib 模块来执行此操作并忽略 gzip 标头？

原文

So here's the problem. I have sample.gz file which is roughly 60KB in size. I want to decompress the first 2000 bytes of this file. I am running into CRC check failed error, I guess because the gzip CRC field appears at the end of file, and it requires the entire gzipped file to decompress. Is there a way to get around this? I don't care about the CRC check. Even if I fail to decompress because of bad CRC, that is OK. Is there a way to get around this and unzip partial .gz files?

The code I have so far is

import gzip
import time
import StringIO

file = open('sample.gz', 'rb')
mybuf = MyBuffer(file)
mybuf = StringIO.StringIO(file.read(2000))
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()
print data

The error encountered is

File "gunzip.py", line 27, in ?
    data = f.read()
File "/usr/local/lib/python2.4/gzip.py", line 218, in read
  self._read(readsize)
File "/usr/local/lib/python2.4/gzip.py", line 273, in _read
  self._read_eof()
File "/usr/local/lib/python2.4/gzip.py", line 309, in _read_eof
  raise IOError, "CRC check failed"
IOError: CRC check failed

Also is there any way to use zlib module to do this and ignore the gzip headers?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

前事休说 2024-08-17 19:01:01

gzip 模块的问题并不是它无法解压缩部分文件，而是仅在最后尝试验证解压缩内容的校验和时才会出现错误。（原始校验和存储在压缩文件的末尾，因此验证永远不会对部分文件起作用。）

关键是欺骗 gzip 跳过验证。 caesar0301的回答通过修改gzip源代码来做到这一点，但没有必要走那么远，简单的猴子修补就可以了做。我编写了这个上下文管理器来在解压缩部分文件时临时替换 gzip.GzipFile._read_eof：

import contextlib

@contextlib.contextmanager
def patch_gzip_for_partial():
    """
    Context manager that replaces gzip.GzipFile._read_eof with a no-op.

    This is useful when decompressing partial files, something that won't
    work if GzipFile does it's checksum comparison.

    """
    _read_eof = gzip.GzipFile._read_eof
    gzip.GzipFile._read_eof = lambda *args, **kwargs: None
    yield
    gzip.GzipFile._read_eof = _read_eof

示例用法：

from cStringIO import StringIO

with patch_gzip_for_partial():
    decompressed = gzip.GzipFile(StringIO(compressed)).read()

The issue with the gzip module is not that it can't decompress the partial file, the error occurs only at the end when it tries to verify the checksum of the decompressed content. (The original checksum is stored at the end of the compressed file so the verification will never, ever work with a partial file.)

The key is to trick gzip into skipping the verification. The answer by caesar0301 does this by modifying the gzip source code, but it's not necessary to go that far, simple monkey patching will do. I wrote this context manager to temporarily replace gzip.GzipFile._read_eof while I decompress the partial file:

import contextlib

@contextlib.contextmanager
def patch_gzip_for_partial():
    """
    Context manager that replaces gzip.GzipFile._read_eof with a no-op.

    This is useful when decompressing partial files, something that won't
    work if GzipFile does it's checksum comparison.

    """
    _read_eof = gzip.GzipFile._read_eof
    gzip.GzipFile._read_eof = lambda *args, **kwargs: None
    yield
    gzip.GzipFile._read_eof = _read_eof

An example usage:

from cStringIO import StringIO

with patch_gzip_for_partial():
    decompressed = gzip.GzipFile(StringIO(compressed)).read()

回复收藏 0 原文

李白 2024-08-17 19:01:01

我似乎您需要研究 Python zlib 库相反，

GZIP 格式依赖于 zlib，但引入了文件级压缩概念以及 CRC 检查，这似乎是您目前不想要/不需要的。

例如，请参阅这些来自 Dough Hellman 的代码片段

编辑：Doubh Hellman 网站上的代码仅显示如何使用 zlib 压缩或解压缩。如上所述，GZIP 是“带有信封的 zlib”，您需要在获取 zlib 压缩数据本身之前对信封进行解码。这里有更多信息，它实际上并不那么复杂：

请参阅 RFC 1952 有关 GZIP 格式的详细信息
此格式以 10 字节标头开头，后跟可选的非压缩元素（例如文件名或注释），后跟 zlib 压缩数据，其本身后跟CRC-32（确切地说是“Adler32”CRC）。
通过使用 Python 的 struct 模块，解析标头应该是相对简单
zlib 序列（或其前几千个字节，因为这就是你想要做的）然后可以使用 python 的 zlib 模块解压缩，如上面的示例所示
可能需要处理的问题：如果有多个文件GZip 存档，如果第二个文件在我们希望解压缩的几千字节块内开始。

抱歉，既没有提供简单的过程，也没有提供现成的代码片段，但是使用上述指示解码文件应该相对快速且简单。

回复收藏 0 原文

青丝拂面 2024-08-17 19:01:01

我看不出您想要解压缩前 2000 个压缩字节的任何可能原因。根据数据的不同，这可能会解压缩为任意数量的输出字节。

当然，您想要解压缩文件，并在解压缩所需的文件大小时停止，例如：

f = gzip.GzipFile(fileobj=open('postcode-code.tar.gz', 'rb'))
data = f.read(4000)
print data

AFAIK，这不会导致读取整个文件。它只会读取获取前 4000 字节所需的数据。

I can't see any possible reason why you would want to decompress the first 2000 compressed bytes. Depending on the data, this may uncompress to any number of output bytes.

Surely you want to uncompress the file, and stop when you have uncompressed as much of the file as you need, something like:

f = gzip.GzipFile(fileobj=open('postcode-code.tar.gz', 'rb'))
data = f.read(4000)
print data

AFAIK, this won't cause the whole file to be read. It will only read as much as is necessary to get the first 4000 bytes.

回复收藏 0 原文

一笑百媚生 2024-08-17 19:01:01

我在Linux下使用python脚本读取gzip工具生成的压缩文件时也遇到这个问题，原始文件丢失。

通过阅读Python的gzip.py的实现，我发现gzip.GzipFile 具有与 File 类类似的方法，并利用 python zip 模块来处理数据解压/压缩。同时，还存在_read_eof()方法来检查每个文件的CRC。

但在某些情况下，例如在没有正确 CRC 的情况下处理 Stream 或 .gz 文件（我的问题），_read_eof() 将引发 IOError("CRC check failed")。因此，我尝试修改gzip模块以禁用CRC检查，最终这个问题消失了。

def _read_eof(self):
    pass

https://github.com/caesar0301/PcapEx/blob/master /live-scripts/gzip_mod.py

我知道这是一个蛮力解决方案，但它可以节省很多时间，使用 zip 模块重写一些低级方法，例如从压缩文件中逐个读取数据，逐行提取数据，其中大部分已经存在于 gzip 模块中。

贾敏

I also encounter this problem when I use my python script to read compressed files generated by gzip tool under Linux and the original files were lost.

By reading the implementation of gzip.py of Python, I found that gzip.GzipFile had similar methods of File class and exploited python zip module to process data de/compressing. At the same time, the _read_eof() method is also present to check the CRC of each file.

But in some situations, like processing Stream or .gz file without correct CRC (my problem), an IOError("CRC check failed") will be raised by _read_eof(). Therefore, I try to modify the gzip module to disable the CRC check and finally this problem disappeared.

def _read_eof(self):
    pass

https://github.com/caesar0301/PcapEx/blob/master/live-scripts/gzip_mod.py

I know it's a brute-force solution, but it save much time to rewrite yourself some low level methods using the zip module, like of reading data chuck by chuck from the zipped files and extract the data line by line, most of which has been present in the gzip module.

Jamin

回复收藏 0 原文

~没有更多了~