如何判断文件是否经过 gzip 压缩?
我有一个 Python 程序,它将接受文本文件作为输入。但是,其中一些文件可能是 gzip 压缩的。
有没有一种跨平台的、可从Python使用的方法来确定文件是否是gzip压缩的?
以下内容是否可靠,或者普通文本文件“意外地”看起来像 gzip 足以让我得到误报?
try:
gzip.GzipFile(filename, 'r')
# compressed
# ...
except:
# not compressed
# ...
I have a Python program which is going to take text files as input. However, some of these files may be gzip compressed.
Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?
Is the following reliable or could an ordinary text file 'accidentally' look gzip-like enough for me to get false positives?
try:
gzip.GzipFile(filename, 'r')
# compressed
# ...
except:
# not compressed
# ...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
接受的答案解释了如何检测一般的gzip压缩文件:测试前两个字节是否<代码>1f 8b。但它没有展示如何在 Python 中实现它。
这是一种方法:
The accepted answer explains how one can detect a gzip compressed file in general: test if the first two bytes are
1f 8b
. However it does not show how to implement it in Python.Here is one way:
gzip 压缩文件的幻数是
1f 8b.尽管对此进行的测试并非 100% 可靠,但“普通文本文件”不太可能以这两个字节开头 — 在 UTF-8 中,它甚至不合法。
通常 gzip 压缩文件带有后缀
.gz
。即使gzip(1)
本身也不会在没有它的情况下解压文件,除非你--force
它。你可以想象使用它,但你仍然必须处理可能的 IOError (无论如何你都必须这样做)。您的方法的一个问题是,如果您向 gzip.GzipFile() 提供未压缩的文件,则该
gzip.GzipFile()
不会引发异常。只有稍后的read()
才会。这意味着,您可能必须两次实现某些程序逻辑。丑陋的。The magic number for gzip compressed files is
1f 8b
. Although testing for this is not 100% reliable, it is highly unlikely that "ordinary text files" start with those two bytes—in UTF-8 it's not even legal.Usually gzip compressed files sport the suffix
.gz
though. Evengzip(1)
itself won't unpack files without it unless you--force
it to. You could conceivably use that, but you'd still have to deal with a possible IOError (which you have to in any case).One problem with your approach is, that
gzip.GzipFile()
will not throw an exception if you feed it an uncompressed file. Only a laterread()
will. This means, that you would probably have to implement some of your program logic twice. Ugly.测试 gzip 文件的幻数是唯一可靠的方法。然而,从 python3.7 开始,您不再需要自己去比较字节了。 gzip 模块将为您比较字节,如果不匹配则引发异常!
从 python3.7 开始,这也有效
从 python3.8 开始,这也有效:
Testing the magic number of a gzip file is the only reliable way to go. However, as of python3.7 there is no need to mess with comparing bytes yourself anymore. The gzip module will compare the bytes for you and raise an exception if they do not match!
As of python3.7, this works
As of python3.8, this also works:
如果它不是 gzip 压缩文件,
gzip
本身会引发OSError
。可以将此方法与其他方法结合起来以增加信心,例如检查 mimetype 或在文件头中查找幻数(请参阅其他答案的示例)并检查扩展名。
gzip
itself will raise anOSError
if it's not a gzipped file.Can combine this approach with some others to increase confidence, such as checking the mimetype or looking for a magic number in the file header (see other answers for an example) and checking the extension.
导入 mimetypes 模块。
它可以自动猜测您拥有什么类型的文件,以及它是否被压缩。
即
返回:(
'text/plain', 'gzip')
Import the mimetypes module.
It can automatically guess what kind of file you have, and if it is compressed.
i.e.
returns:
('text/plain', 'gzip')
在 python3 中似乎不能很好地工作...
返回 (None, None)
但是从unix命令“文件”
:~>文件数据集/测试
数据集/测试:gzip 压缩数据,是“iostat_collection”,来自 Unix,最后修改:2015 年 1 月 29 日星期四 07:09:34
Doesn’t seem to work well in python3...
returns (None, None)
But from the unix command "File"
:~> file datasets/test
datasets/test: gzip compressed data, was "iostat_collection", from Unix, last modified: Thu Jan 29 07:09:34 2015