从 Python 中的生成器创建 zip 文件?
我有大量数据(几兆)需要用 Python 写入 zip 文件。 我无法将其全部加载到内存中以传递给 ZipFile 的 .writestr 方法,而且我真的不想使用临时文件将其全部传输到磁盘,然后将其读回。
有没有办法将生成器或类似文件的对象提供给 ZipFile 库? 或者是否有某种原因似乎不支持此功能?
我所说的 zip 文件是指 zip 文件。 正如 Python zipfile 包中所支持的那样。
I've got a large amount of data (a couple gigs) I need to write to a zip file in Python. I can't load it all into memory at once to pass to the .writestr method of ZipFile, and I really don't want to feed it all out to disk using temporary files and then read it back.
Is there a way to feed a generator or a file-like object to the ZipFile library? Or is there some reason this capability doesn't seem to be supported?
By zip file, I mean zip file. As supported in the Python zipfile package.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
唯一的解决方案是重写用于压缩文件以从缓冲区读取的方法。 将其添加到标准库中是很简单的; 我有点惊讶它还没有完成。 我收集到,有很多人一致认为整个界面需要彻底修改,但这似乎阻碍了任何渐进式改进。
The only solution is to rewrite the method it uses for zipping files to read from a buffer. It would be trivial to add this to the standard libraries; I'm kind of amazed it hasn't been done yet. I gather there's a lot of agreement the entire interface needs to be overhauled, and that seems to be blocking any incremental improvements.
在Python 3.5中进行了更改(来自官方文档):添加了支持写入不可查找流。
这意味着现在对于 zipfile.ZipFile 我们可以使用不将整个文件存储在内存中的流。 此类流不支持移动整个数据量。
这是一个简单的生成器:
path
是大文件或目录或pathlike
对象的字符串路径。stream
是此类的不可查找流实例(根据官方文档):您可以在线尝试此代码:https://repl.it/@IvanErgunov/zipfilegenerator
还有另一种方法可以创建一个没有
ZipInfo
的生成器并手动读取和分割大文件。 您可以将queue.Queue()
对象传递给UnseekableStream()
对象,并在另一个线程中写入此队列。 然后在当前线程中,您可以简单地以可迭代的方式从该队列中读取块。 请参阅文档PS
allanlei 的 Python Zipstream 是过时且不可靠的方式。 这是在正式完成之前尝试添加对不可搜索流的支持。
Changed in Python 3.5 (from official docs): Added support for writing to unseekable streams.
This means that now for
zipfile.ZipFile
we can use streams which do not store the entire file in memory. Such streams do not support movement over the entire data volume.So this is simple generator:
path
is a string path of the large file or directory orpathlike
object.stream
is the unseekable stream instance of the class like this (designed according to official docs):You can try this code online: https://repl.it/@IvanErgunov/zipfilegenerator
There is also another way to create a generator without
ZipInfo
and manually reading and dividing your large file. You can pass thequeue.Queue()
object to yourUnseekableStream()
object and write to this queue in another thread. Then in current thread you can simply read chunks from this queue in iterable way. See docsP.S.
Python Zipstream by allanlei is outdated and unreliable way. It was an attempt to add support for unseekable streams before it was done officially.
我采用了Chris B.的答案 并创建了一个完整的解决方案。 如果其他人感兴趣的话,这里是:
I took Chris B.'s answer and created a complete solution. Here it is in case anyone else is interested:
gzip.GzipFile 将数据写入 gzipped chunks 中,您可以根据从文件读取的行数设置块的大小。
一个例子:
gzip.GzipFile writes the data in gzipped chunks , which you can set the size of your chunks according to the numbers of lines read from the files.
an example:
基本的压缩是由 zlib.compressobj 完成的。 ZipFile(在 MacOSX 上的 Python 2.5 下似乎是编译的)。 Python 2.3版本如下。
您可以看到它以 8k 块的形式构建压缩文件。 提取源文件信息比较复杂,因为zip文件头中记录了很多源文件属性(例如未压缩的大小)。
The essential compression is done by zlib.compressobj. ZipFile (under Python 2.5 on MacOSX appears to be compiled). The Python 2.3 version is as follows.
You can see that it builds the compressed file in 8k chunks. Taking out the source file information is complex because a lot of source file attributes (like uncompressed size) is recorded in the zip file header.
一些(许多?大多数?)压缩算法基于查看整个文件的冗余。
一些压缩库会根据最适合文件的方式在多种压缩算法之间进行选择。
我相信 ZipFile 模块会执行此操作,因此它希望查看整个文件,而不仅仅是一次查看各个文件。
因此,它无法与无法加载到内存中的生成器或文件一起使用。 这可以解释 Zipfile 库的局限性。
Some (many? most?) compression algorithms are based on looking at redundancies across the entire file.
Some compression libraries will choose between several compression algorithms based on which works best on the file.
I believe the ZipFile module does this, so it wants to see the entire file, not just pieces at a time.
Hence, it won't work with generators or files to big to load in memory. That would explain the limitation of the Zipfile library.
如果有人偶然发现这个问题,这个问题在 2017 年仍然与 Python 2.7 相关,这里有一个真正的流式 zip 文件的工作解决方案,不需要像其他情况那样输出可查找。 秘诀是设置通用位标志的第 3 位(请参阅 https://pkware .cachefly.net/webdocs/casestudies/APPNOTE.TXT 第 4.3.9.1 节)。
请注意,此实现将始终创建 ZIP64 样式的文件,允许流处理任意大的文件。 它包含一个丑陋的 hack 来强制中央目录记录的 zip64 结尾,因此请注意,它会导致您的进程编写的所有 zip 文件变成 ZIP64 样式。
In case anyone stumbles upon this question, which is still relevant in 2017 for Python 2.7, here's a working solution for a true streaming zip file, with no requirement for the output to be seekable as in the other cases. The secret is to set bit 3 of the general purpose bit flag (see https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT section 4.3.9.1).
Note that this implementation will always create a ZIP64-style file, allowing the streaming to work for arbitrarily large files. It includes an ugly hack to force the zip64 end of central directory record, so be aware it will cause all zipfiles written by your process to become ZIP64-style.
gzip 库将采用类似文件的对象进行压缩。
您仍然需要提供一个名义文件名以包含在 zip 文件中,但您可以将数据源传递给 fileobj。
(这个答案与 Damnsweet 的不同,重点应该是增量读取的数据源,而不是增量写入的压缩文件。)
现在我看到了原来的提问者不会接受 Gzip :-(
The gzip library will take a file-like object for compression.
You still need to provide a nominal filename for inclusion in the zip file, but you can pass your data-source to the fileobj.
(This answer differs from that of Damnsweet, in that the focus should be on the data-source being incrementally read, not the compressed file being incrementally written.)
And I see now the original questioner won't accept Gzip :-(
现在使用 python 2.7,您可以将数据添加到文件的 zipfile 中:
http ://docs.python.org/2/library/zipfile#zipfile.ZipFile.writestr
Now with python 2.7 you can add data to the zipfile insted of the file :
http://docs.python.org/2/library/zipfile#zipfile.ZipFile.writestr
现在是 2017 年了。如果您仍然希望优雅地做到这一点,请使用 allanlei 的 Python Zipstream。
到目前为止,它可能是唯一能够实现这一目标的编写良好的库。
This is 2017. If you are still looking to do this elegantly, use Python Zipstream by allanlei.
So far, it is probably the only well written library to accomplish that.
gzip.GzipFile 将数据写入 gzipped chunks 中,您可以根据从文件读取的行数设置块的大小。
一个例子:
gzip.GzipFile writes the data in gzipped chunks , which you can set the size of your chunks according to the numbers of lines read from the files.
an example:
您可以使用 stream-zip (完整披露:主要由我编写)。
假设您有要压缩的字节生成器:
您可以创建这些生成器的压缩字节的单个迭代器:
然后,例如,通过以下方式将此迭代器保存到磁盘:
You can use stream-zip for this (full disclosure: written mostly by me).
Say you have generators of bytes you want to zip:
You can created a single iterable of the zipped bytes of these generators:
And then, for example, save this iterable to disk by:
zipstream-ng 库可以处理这种情况:
The zipstream-ng library handles this exact scenario: