Python 中的压缩编解码器如何工作?
我正在查询数据库并使用 Python 归档结果,并且在将数据写入日志文件时尝试压缩数据。不过,我遇到了一些问题。
我的代码如下所示:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
但是,我的输出文件的大小为 1,409,780。在该文件上运行 bunzip2
会产生大小为 943,634 的文件,并在该文件上运行 bzip2
会产生大小为 217,275 的文件。换句话说,未压缩的文件明显小于使用 Python 的 bzip 编解码器压缩的文件。 除了在命令行上运行 bzip2
之外,还有其他方法可以解决此问题吗?
我尝试了 Python 的 gzip 编解码器(将行更改为 codecs.open(archive_file, 'a+', 'zip')
) 看看它是否解决了问题。我仍然会收到大文件,但当我尝试解压缩文件时,也会收到 gzip: archive_file: not in gzip format
错误。 那里发生了什么?
编辑:我最初以追加模式而不是写入模式打开文件。虽然这可能是问题,也可能不是问题,但如果文件以“w”模式打开,问题仍然存在。
I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2
on the file results in a file with a size of 943,634, and running bzip2
on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2
on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')
) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format
error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
正如其他发帖者所指出的,问题在于
编解码器
库不使用增量编码器来编码数据;相反,它将馈送到write
方法的每个数据片段编码为压缩块。这是非常低效的,对于一个设计用于流的库来说,这是一个糟糕的设计决策。讽刺的是,Python 中已经内置了一个完全合理的增量 bz2 编码器。创建一个自动执行正确操作的“类似文件”类并不困难。
警告:在此示例中,我以追加模式打开了文件;将多个压缩流附加到单个文件与
bunzip2
完美配合,但 Python 本身无法处理它(尽管有 是它的一个补丁)。如果您需要将创建的压缩文件读回 Python,请坚持每个文件一个流。As other posters have noted, the issue is that the
codecs
library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to thewrite
method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with
bunzip2
, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.问题似乎是在每个
write()
上写入输出。这会导致每一行被压缩在自己的 bzip 块中。在将其写入文件之前,我会尝试在内存中构建一个更大的字符串(或字符串列表,如果您担心性能)。合适的大小是 900K(或更大),因为这是 bzip2 使用的块大小
The problem seems to be that output is being written on every
write()
. This causes each line to be compressed in its own bzip block.I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
该问题是由于您使用附加模式造成的,这会导致文件包含多个压缩数据块。看这个例子:
在我的系统上,这会生成一个大小为 12 字节的文件。让我们看看它包含什么:
好的,现在让我们以追加模式进行另一次写入:
该文件现在大小为 24 字节,其内容为:
这里发生的情况是 unzip 需要单个压缩流。您必须检查规范以了解多个串联流的官方行为,但根据我的经验,他们处理第一个流并忽略其余数据。这就是 Python 所做的。
我希望bunzip2 也在做同样的事情。因此,实际上您的文件是压缩的,并且比它包含的数据小得多。但是当您通过bunzip2运行它时,您只能返回写入其中的第一组记录;其余的被丢弃。
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
On my system, this produces a file 12 bytes in size. Let's see what it contains:
Okay, now let's do another write in append mode:
The file is now 24 bytes in size, and its contents are:
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
我不确定这与编解码器的方式有多大不同,但如果您使用 gzip 模块中的 GzipFile,您可以增量附加到文件,但它不会压缩得很好,除非您在某个位置写入大量数据。时间(可能> 1 KB)。这就是压缩算法的本质。如果您正在写入的数据不是非常重要(即,如果您的进程终止,您可以处理丢失它的情况),那么您可以编写一个缓冲的 GzipFile 类来包装导入的类,该类会写出更大的数据块。
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.