Python:创建类似流式 gzip 的文件?
我正在尝试找出使用 Python 的 zlib
压缩流的最佳方法。
我有一个类似文件的输入流(input
,下面)和一个接受类似文件的输出函数(output_function
,下面):
with open("file") as input:
output_function(input)
我想要在将 input
块发送到 output_function
之前对其进行 gzip 压缩:
with open("file") as input:
output_function(gzip_stream(input))
它看起来像 gzip 模块假设输入或输出将是磁盘上的 gzip 文件...所以我假设 zlib 模块是我想要的。
然而,它本身并没有提供一种简单的方法来创建类似流文件的...并且它支持的流压缩是通过手动将数据添加到压缩缓冲区,然后刷新该缓冲区的方式来实现的。
当然,我可以围绕 zlib.compress.compress
和 zlib.compress.flush
编写一个包装器(compress
由 zlib 返回.compressobj()
),但我担心缓冲区大小错误或类似的情况。
那么,使用 Python 创建类似 gzip 压缩的流式文件的最简单方法是什么?
编辑:澄清一下,输入流和压缩输出流都太大,无法容纳在内存中,因此类似 output_function(StringIO(zlib.compress(input.read())) )
并没有真正解决问题。
I'm trying to figure out the best way to compress a stream with Python's zlib
.
I've got a file-like input stream (input
, below) and an output function which accepts a file-like (output_function
, below):
with open("file") as input:
output_function(input)
And I'd like to gzip-compress input
chunks before sending them to output_function
:
with open("file") as input:
output_function(gzip_stream(input))
It looks like the gzip module assumes that either the input or the output will be a gzip'd file-on-disk… So I assume that the zlib module is what I want.
However, it doesn't natively offer a simple way to create a stream file-like… And the stream-compression it does support comes by way of manually adding data to a compression buffer, then flushing that buffer.
Of course, I could write a wrapper around zlib.Compress.compress
and zlib.Compress.flush
(Compress
is returned by zlib.compressobj()
), but I'd be worried about getting buffer sizes wrong, or something similar.
So, what's the simplest way to create a streaming, gzip-compressing file-like with Python?
Edit: To clarify, the input stream and the compressed output stream are both too large to fit in memory, so something like output_function(StringIO(zlib.compress(input.read())))
doesn't really solve the problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
它相当混乱(自引用等;只需花几分钟编写它,没有什么真正优雅的),但如果您仍然对使用
gzip
而不是zlib 感兴趣,它可以满足您的需求直接。
基本上,GzipWrap 是一个(非常有限的)类文件对象,它从给定的可迭代对象中生成一个 gzip 压缩文件(例如,类文件对象、字符串列表、任何生成器......)
当然,它会生成二进制文件,因此实现“readline”是没有意义的。
您应该能够扩展它以涵盖其他情况或用作可迭代对象本身。
It's quite kludgy (self referencing, etc; just put a few minutes writing it, nothing really elegant), but it does what you want if you're still interested in using
gzip
instead ofzlib
directly.Basically,
GzipWrap
is a (very limited) file-like object that produces a gzipped file out of a given iterable (e.g., a file-like object, a list of strings, any generator...)Of course, it produces binary so there was no sense in implementing "readline".
You should be able to expand it to cover other cases or to be used as an iterable object itself.
这是一个基于 Ricardo Cárdenes 非常有用的答案的更清晰的非自引用版本。
优点:
Here is a cleaner, non-self-referencing version based on Ricardo Cárdenes' very helpful answer.
Advantages:
gzip 模块支持压缩为类似文件的对象,将 fileobj 参数传递给 GzipFile,以及文件名。您传入的文件名不需要存在,但 gzip 标头有一个需要填写的文件名字段。
更新
这个答案不起作用。示例:
输出:
The gzip module supports compressing to a file-like object, pass a fileobj parameter to GzipFile, as well as a filename. The filename you pass in doesn't need to exist, but the gzip header has a filename field which needs to be filled out.
Update
This answer does not work. Example:
output:
将 cStringIO(或 StringIO)模块与 zlib 结合使用:
Use the cStringIO (or StringIO) module in conjunction with zlib:
这是可行的(至少在 python 3 中):
这里它写入 s3fs 的文件对象,并对其进行 gzip 压缩。
神奇之处在于
f
参数,它是 GzipFile 的fileobj
。您必须提供 gzip 标头的文件名。This works (at least in python 3):
Here it writes to s3fs's file object with a gzip compression on it.
The magic is the
f
parameter, which is GzipFile'sfileobj
. You have to provide a file name for gzip's header.更干净&由可重用组件组成的更通用的版本:
上面的函数来自我的要点:
iter_io
和io_iter
提供与Iterable[AnyStr]
之间的透明转换 <->SupportsRead[AnyStr]
igzip
进行流式 gzip 压缩prefetch
同时通过线程从底层可迭代中拉取,正常情况下向消费者屈服,用于并发读/写An even cleaner & more generalized version made of reusable components:
The functions above are from my gist:
iter_io
andio_iter
provide transparent conversion to/fromIterable[AnyStr]
<->SupportsRead[AnyStr]
igzip
does streaming gzip compressionprefetch
concurrently pulls from an underlying iterable via a thread, yielding to consumer as normal, for concurrent read/write