当前位置：文江博客话题详情

Python 内存 zip 库

发布于 2024-08-25 22:18:31 字数 257 浏览 7 评论 0原文

是否有一个 Python 库允许在内存中操作 zip 存档，而无需使用实际的磁盘文件？

ZipFile 库不允许您更新存档。唯一的方法似乎是将其提取到一个目录，进行更改，然后从该目录创建一个新的 zip。我想在没有磁盘访问的情况下修改 zip 存档，因为我将下载它们，进行更改，然后再次上传它们，所以我没有理由存储它们。

类似于 Java 的 ZipInputStream/ZipOutputStream 的东西就可以解决这个问题，尽管任何避免磁盘访问的接口都可以。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

恍梦境° 2024-09-01 22:18:31

Python 3

import io
import zipfile

zip_buffer = io.BytesIO()

with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
    for file_name, data in [('1.txt', io.BytesIO(b'111')),
                            ('2.txt', io.BytesIO(b'222'))]:
        zip_file.writestr(file_name, data.getvalue())

with open('C:/1.zip', 'wb') as f:
    f.write(zip_buffer.getvalue())

PYTHON 3

import io
import zipfile

zip_buffer = io.BytesIO()

with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
    for file_name, data in [('1.txt', io.BytesIO(b'111')),
                            ('2.txt', io.BytesIO(b'222'))]:
        zip_file.writestr(file_name, data.getvalue())

with open('C:/1.zip', 'wb') as f:
    f.write(zip_buffer.getvalue())

回复收藏 0 原文

贱贱哒 2024-09-01 22:18:31

根据 Python 文档：

class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])

  Open a ZIP file, where file can be either a path to a file (a string) or a file-like object.

因此，要在内存中打开文件，只需创建一个类似文件的对象（也许使用 BytesIO）。

file_like_object = io.BytesIO(my_zip_data)
zipfile_ob = zipfile.ZipFile(file_like_object)

According to the Python docs:

class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])

  Open a ZIP file, where file can be either a path to a file (a string) or a file-like object.

So, to open the file in memory, just create a file-like object (perhaps using BytesIO).

file_like_object = io.BytesIO(my_zip_data)
zipfile_ob = zipfile.ZipFile(file_like_object)

回复收藏 0 原文

极度宠爱 2024-09-01 22:18:31

从文章 Python 中的内存中 Zip：

下面是我在 2008 年 5 月发表的一篇关于使用 Python 在内存中压缩的帖子，自从 Posterous 关闭后重新发布。
我最近注意到有一个付费组件可用于使用 Python 压缩内存中的文件。考虑到这是应该免费的东西，我将以下代码放在一起。它只经过了非常基本的测试，因此如果有人发现任何错误，请告诉我，我会更新。

import zipfile
import StringIO

class InMemoryZip(object):
    def __init__(self):
        # Create the in-memory file-like object
        self.in_memory_zip = StringIO.StringIO()

    def append(self, filename_in_zip, file_contents):
        '''Appends a file with name filename_in_zip and contents of 
        file_contents to the in-memory zip.'''
        # Get a handle to the in-memory zip in append mode
        zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)

        # Write the file to the in-memory zip
        zf.writestr(filename_in_zip, file_contents)

        # Mark the files as having been created on Windows so that
        # Unix permissions are not inferred as 0000
        for zfile in zf.filelist:
            zfile.create_system = 0        

        return self

    def read(self):
        '''Returns a string with the contents of the in-memory zip.'''
        self.in_memory_zip.seek(0)
        return self.in_memory_zip.read()

    def writetofile(self, filename):
        '''Writes the in-memory zip to a file.'''
        f = file(filename, "w")
        f.write(self.read())
        f.close()

if __name__ == "__main__":
    # Run a test
    imz = InMemoryZip()
    imz.append("test.txt", "Another test").append("test2.txt", "Still another")
    imz.writetofile("test.zip")

From the article In-Memory Zip in Python:

Below is a post of mine from May of 2008 on zipping in memory with Python, re-posted since Posterous is shutting down.
I recently noticed that there is a for-pay component available to zip files in-memory with Python. Considering this is something that should be free, I threw together the following code. It has only gone through very basic testing, so if anyone finds any errors, let me know and I’ll update this.

import zipfile
import StringIO

class InMemoryZip(object):
    def __init__(self):
        # Create the in-memory file-like object
        self.in_memory_zip = StringIO.StringIO()

    def append(self, filename_in_zip, file_contents):
        '''Appends a file with name filename_in_zip and contents of 
        file_contents to the in-memory zip.'''
        # Get a handle to the in-memory zip in append mode
        zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)

        # Write the file to the in-memory zip
        zf.writestr(filename_in_zip, file_contents)

        # Mark the files as having been created on Windows so that
        # Unix permissions are not inferred as 0000
        for zfile in zf.filelist:
            zfile.create_system = 0        

        return self

    def read(self):
        '''Returns a string with the contents of the in-memory zip.'''
        self.in_memory_zip.seek(0)
        return self.in_memory_zip.read()

    def writetofile(self, filename):
        '''Writes the in-memory zip to a file.'''
        f = file(filename, "w")
        f.write(self.read())
        f.close()

if __name__ == "__main__":
    # Run a test
    imz = InMemoryZip()
    imz.append("test.txt", "Another test").append("test2.txt", "Still another")
    imz.writetofile("test.zip")

回复收藏 0 原文

堇色安年 2024-09-01 22:18:31

Ethier 提供的示例有几个问题，其中一些问题很严重：

不适用于 Windows 上的真实数据。 ZIP 文件是二进制文件，其数据应始终使用打开的“wb”文件写入，
ZIP 文件会附加到每个文件，这是低效的。它可以打开并保留为 InMemoryZip 属性，
文档指出 ZIP 文件应该显式关闭，这不是在追加函数中完成的（它可能有效（例如），因为 zf 出去了）范围并关闭 ZIP 文件）
每次附加文件时都会为 zip 文件中的所有文件设置 create_system 标志，而不是每个文件只设置一次。
关于Python< 3 cStringIO 比 StringIO
在 Python 3 上无法工作要高效得多（原文是 3.0 版本之前的文章，但到发布代码时 3.1 已经发布很长时间了）。

如果您安装ruamel.std.zipfile（我是该文件的作者），则可以使用更新版本。之后

pip install ruamel.std.zipfile

在这里，您可以执行以下操作：

import ruamel.std.zipfile as zipfile

# Run a test
zipfile.InMemoryZipFile()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")

您也可以使用 imz.data 将内容写入您需要的任何位置。

您还可以使用 with 语句，如果您提供文件名，ZIP 的内容将在离开该上下文时写入：

with zipfile.InMemoryZipFile('test.zip') as imz:
    imz.append("test.txt", "Another test").append("test2.txt", "Still another")

由于延迟写入光盘，您实际上可以从旧的文件中读取在该上下文中的 test.zip 。

The example Ethier provided has several problems, some of them major:

doesn't work for real data on Windows. A ZIP file is binary and its data should always be written with a file opened 'wb'
the ZIP file is appended to for each file, this is inefficient. It can just be opened and kept as an InMemoryZip attribute
the documentation states that ZIP files should be closed explicitly, this is not done in the append function (it probably works (for the example) because zf goes out of scope and that closes the ZIP file)
the create_system flag is set for all the files in the zipfile every time a file is appended instead of just once per file.
on Python < 3 cStringIO is much more efficient than StringIO
doesn't work on Python 3 (the original article was from before the 3.0 release, but by the time the code was posted 3.1 had been out for a long time).

An updated version is available if you install ruamel.std.zipfile (of which I am the author). After

pip install ruamel.std.zipfile

or including the code for the class from here, you can do:

import ruamel.std.zipfile as zipfile

# Run a test
zipfile.InMemoryZipFile()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")

You can alternatively write the contents using imz.data to any place you need.

You can also use the with statement, and if you provide a filename, the contents of the ZIP will be written on leaving that context:

with zipfile.InMemoryZipFile('test.zip') as imz:
    imz.append("test.txt", "Another test").append("test2.txt", "Still another")

because of the delayed writing to disc, you can actually read from an old test.zip within that context.

回复收藏 0 原文

安静被遗忘 2024-09-01 22:18:31

我正在使用 Flask 创建一个内存中的 zip 文件并将其作为下载返回。基于弗拉基米尔上面的示例。 seek(0) 花了一段时间才弄清楚。

import io
import zipfile

zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
    for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
        zip_file.writestr(file_name, data.getvalue())

zip_buffer.seek(0)
return send_file(zip_buffer, attachment_filename='filename.zip', as_attachment=True)

I am using Flask to create an in-memory zipfile and return it as a download. Builds on the example above from Vladimir. The seek(0) took a while to figure out.

import io
import zipfile

zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
    for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
        zip_file.writestr(file_name, data.getvalue())

zip_buffer.seek(0)
return send_file(zip_buffer, attachment_filename='filename.zip', as_attachment=True)

回复收藏 0 原文

青瓷清茶倾城歌 2024-09-01 22:18:31

帮助程序根据 {'1.txt': 'string', '2.txt": b'bytes'} 等数据创建包含多个文件的内存 zip 文件

import io, zipfile

def prepare_zip_file_content(file_name_content: dict) -> bytes:
    """returns Zip bytes ready to be saved with 
    open('C:/1.zip', 'wb') as f: f.write(bytes)
    @file_name_content dict like {'1.txt': 'string', '2.txt": b'bytes'} 
    """
    zip_buffer = io.BytesIO()
    with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
        for file_name, file_data in file_name_content.items():
            zip_file.writestr(file_name, file_data)

    zip_buffer.seek(0)
    return zip_buffer.getvalue()

Helper to create in-memory zip file with multiple files based on data like {'1.txt': 'string', '2.txt": b'bytes'}

import io, zipfile

def prepare_zip_file_content(file_name_content: dict) -> bytes:
    """returns Zip bytes ready to be saved with 
    open('C:/1.zip', 'wb') as f: f.write(bytes)
    @file_name_content dict like {'1.txt': 'string', '2.txt": b'bytes'} 
    """
    zip_buffer = io.BytesIO()
    with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
        for file_name, file_data in file_name_content.items():
            zip_file.writestr(file_name, file_data)

    zip_buffer.seek(0)
    return zip_buffer.getvalue()

回复收藏 0 原文

⊕婉儿 2024-09-01 22:18:31

我想在不访问磁盘的情况下修改 zip 存档，因为我将下载它们、进行更改并再次上传它们，所以我没有理由存储它们

这可以使用两个库 https://github.com/uktrade/stream-unzip 和 https://github.com/uktrade/stream-zip （完整披露：由我编写）。根据更改的情况，您甚至可能不必立即将整个 zip 存储在内存中。

假设您只想下载、解压缩、压缩并重新上传。有点毫无意义，但您可以对解压缩的内容进行一些更改：

from datetime import datetime
import httpx
from stream_unzip import stream_unzip
from stream_zip import stream_zip, ZIP_64

def get_source_bytes_iter(url):
    with httpx.stream('GET', url) as r:
        yield from r.iter_bytes()

def get_target_files(files):
    # stream-unzip doesn't expose perms or modified_at, but stream-zip requires them
    modified_at = datetime.now()
    perms = 0o600

    for name, _, chunks in files:
        # Could change name, manipulate chunks, skip a file, or yield a new file
        yield name.decode(), modified_at, perms, ZIP_64, chunks

source_url = 'https://source.test/file.zip'
target_url = 'https://target.test/file.zip'

source_bytes_iter = get_source_bytes_iter(source_url)
source_files = stream_unzip(source_bytes_iter)
target_files = get_target_files(source_files)
target_bytes_iter = stream_zip(target_files)

httpx.put(target_url, data=target_bytes_iter)

I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them

This is possible using the two libraries https://github.com/uktrade/stream-unzip and https://github.com/uktrade/stream-zip (full disclosure: written by me). And depending on the changes, you might not even have to store the entire zip in memory at once.

Say you just want to download, unzip, zip, and re-upload. Slightly pointless, but you could slot in some changes to the unzipped content:

from datetime import datetime
import httpx
from stream_unzip import stream_unzip
from stream_zip import stream_zip, ZIP_64

def get_source_bytes_iter(url):
    with httpx.stream('GET', url) as r:
        yield from r.iter_bytes()

def get_target_files(files):
    # stream-unzip doesn't expose perms or modified_at, but stream-zip requires them
    modified_at = datetime.now()
    perms = 0o600

    for name, _, chunks in files:
        # Could change name, manipulate chunks, skip a file, or yield a new file
        yield name.decode(), modified_at, perms, ZIP_64, chunks

source_url = 'https://source.test/file.zip'
target_url = 'https://target.test/file.zip'

source_bytes_iter = get_source_bytes_iter(source_url)
source_files = stream_unzip(source_bytes_iter)
target_files = get_target_files(source_files)
target_bytes_iter = stream_zip(target_files)

httpx.put(target_url, data=target_bytes_iter)

回复收藏 0 原文

情徒 2024-09-01 22:18:31

您可以通过 ctypes 在 Python 中使用库 libarchive - 它提供了在内存中操作 ZIP 数据的方法，专注于流媒体（至少历史上如此）。

假设我们想要在从 HTTP 服务器下载时即时解压缩 ZIP 文件。可以使用下面的代码

from contextlib import contextmanager
from ctypes import CFUNCTYPE, POINTER, create_string_buffer, cdll, byref, c_ssize_t, c_char_p, c_int, c_void_p, c_char
from ctypes.util import find_library

import httpx

def get_zipped_chunks(url, chunk_size=6553):
    with httpx.stream('GET', url) as r:
        yield from r.iter_bytes()

def stream_unzip(zipped_chunks, chunk_size=65536):
    # Library
    libarchive = cdll.LoadLibrary(find_library('archive'))

    # Callback types
    open_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
    read_callback_type = CFUNCTYPE(c_ssize_t, c_void_p, c_void_p, POINTER(POINTER(c_char)))
    close_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)

    # Function types
    libarchive.archive_read_new.restype = c_void_p
    libarchive.archive_read_open.argtypes = [c_void_p, c_void_p, open_callback_type, read_callback_type, close_callback_type]
    libarchive.archive_read_finish.argtypes = [c_void_p]

    libarchive.archive_entry_new.restype = c_void_p

    libarchive.archive_read_next_header.argtypes = [c_void_p, c_void_p]
    libarchive.archive_read_support_compression_all.argtypes = [c_void_p]
    libarchive.archive_read_support_format_all.argtypes = [c_void_p]

    libarchive.archive_entry_pathname.argtypes = [c_void_p]
    libarchive.archive_entry_pathname.restype = c_char_p

    libarchive.archive_read_data.argtypes = [c_void_p, POINTER(c_char), c_ssize_t]
    libarchive.archive_read_data.restype = c_ssize_t

    libarchive.archive_error_string.argtypes = [c_void_p]
    libarchive.archive_error_string.restype = c_char_p

    ARCHIVE_EOF = 1
    ARCHIVE_OK = 0

    it = iter(zipped_chunks)
    compressed_bytes = None  # Make sure not garbage collected

    @contextmanager
    def get_archive():
        archive = libarchive.archive_read_new()
        if not archive:
            raise Exception('Unable to allocate archive')

        try:
            yield archive
        finally:
            libarchive.archive_read_finish(archive)

    def read_callback(archive, client_data, buffer):
        nonlocal compressed_bytes

        try:
            compressed_bytes = create_string_buffer(next(it))
        except StopIteration:
            return 0
        else:
            buffer[0] = compressed_bytes
            return len(compressed_bytes) - 1

    def uncompressed_chunks(archive):
        uncompressed_bytes = create_string_buffer(chunk_size)
        while (num := libarchive.archive_read_data(archive, uncompressed_bytes, len(uncompressed_bytes))) > 0:
            yield uncompressed_bytes.value[:num]
        if num < 0:
            raise Exception(libarchive.archive_error_string(archive))

    with get_archive() as archive: 
        libarchive.archive_read_support_compression_all(archive)
        libarchive.archive_read_support_format_all(archive)

        libarchive.archive_read_open(
            archive, 0,
            open_callback_type(0), read_callback_type(read_callback), close_callback_type(0),
        )
        entry = c_void_p(libarchive.archive_entry_new())
        if not entry:
            raise Exception('Unable to allocate entry')

        while (status := libarchive.archive_read_next_header(archive, byref(entry))) == ARCHIVE_OK:
            yield (libarchive.archive_entry_pathname(entry), uncompressed_chunks(archive))

        if status != ARCHIVE_EOF:
            raise Exception(libarchive.archive_error_string(archive))

来执行此操作。

zipped_chunks = get_zipped_chunks('https://domain.test/file.zip')
files = stream_unzip(zipped_chunks)

for name, uncompressed_chunks in stream_unzip(zipped_chunks):
    print(name)
    for uncompressed_chunk in uncompressed_chunks:
        print(uncompressed_chunk)

事实上，由于 libarchive 支持多种存档格式，并且上面没有任何内容是特定于 ZIP 的，因此它很可能适用于其他格式。

You can use the library libarchive in Python through ctypes - it offers ways of manipulating ZIP data in memory, with a focus on streaming (at least historically).

Say we want to uncompress ZIP files on the fly while downloading from an HTTP server. The below code

from contextlib import contextmanager
from ctypes import CFUNCTYPE, POINTER, create_string_buffer, cdll, byref, c_ssize_t, c_char_p, c_int, c_void_p, c_char
from ctypes.util import find_library

import httpx

def get_zipped_chunks(url, chunk_size=6553):
    with httpx.stream('GET', url) as r:
        yield from r.iter_bytes()

def stream_unzip(zipped_chunks, chunk_size=65536):
    # Library
    libarchive = cdll.LoadLibrary(find_library('archive'))

    # Callback types
    open_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
    read_callback_type = CFUNCTYPE(c_ssize_t, c_void_p, c_void_p, POINTER(POINTER(c_char)))
    close_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)

    # Function types
    libarchive.archive_read_new.restype = c_void_p
    libarchive.archive_read_open.argtypes = [c_void_p, c_void_p, open_callback_type, read_callback_type, close_callback_type]
    libarchive.archive_read_finish.argtypes = [c_void_p]

    libarchive.archive_entry_new.restype = c_void_p

    libarchive.archive_read_next_header.argtypes = [c_void_p, c_void_p]
    libarchive.archive_read_support_compression_all.argtypes = [c_void_p]
    libarchive.archive_read_support_format_all.argtypes = [c_void_p]

    libarchive.archive_entry_pathname.argtypes = [c_void_p]
    libarchive.archive_entry_pathname.restype = c_char_p

    libarchive.archive_read_data.argtypes = [c_void_p, POINTER(c_char), c_ssize_t]
    libarchive.archive_read_data.restype = c_ssize_t

    libarchive.archive_error_string.argtypes = [c_void_p]
    libarchive.archive_error_string.restype = c_char_p

    ARCHIVE_EOF = 1
    ARCHIVE_OK = 0

    it = iter(zipped_chunks)
    compressed_bytes = None  # Make sure not garbage collected

    @contextmanager
    def get_archive():
        archive = libarchive.archive_read_new()
        if not archive:
            raise Exception('Unable to allocate archive')

        try:
            yield archive
        finally:
            libarchive.archive_read_finish(archive)

    def read_callback(archive, client_data, buffer):
        nonlocal compressed_bytes

        try:
            compressed_bytes = create_string_buffer(next(it))
        except StopIteration:
            return 0
        else:
            buffer[0] = compressed_bytes
            return len(compressed_bytes) - 1

    def uncompressed_chunks(archive):
        uncompressed_bytes = create_string_buffer(chunk_size)
        while (num := libarchive.archive_read_data(archive, uncompressed_bytes, len(uncompressed_bytes))) > 0:
            yield uncompressed_bytes.value[:num]
        if num < 0:
            raise Exception(libarchive.archive_error_string(archive))

    with get_archive() as archive: 
        libarchive.archive_read_support_compression_all(archive)
        libarchive.archive_read_support_format_all(archive)

        libarchive.archive_read_open(
            archive, 0,
            open_callback_type(0), read_callback_type(read_callback), close_callback_type(0),
        )
        entry = c_void_p(libarchive.archive_entry_new())
        if not entry:
            raise Exception('Unable to allocate entry')

        while (status := libarchive.archive_read_next_header(archive, byref(entry))) == ARCHIVE_OK:
            yield (libarchive.archive_entry_pathname(entry), uncompressed_chunks(archive))

        if status != ARCHIVE_EOF:
            raise Exception(libarchive.archive_error_string(archive))

can be used as follows to do that

zipped_chunks = get_zipped_chunks('https://domain.test/file.zip')
files = stream_unzip(zipped_chunks)

for name, uncompressed_chunks in stream_unzip(zipped_chunks):
    print(name)
    for uncompressed_chunk in uncompressed_chunks:
        print(uncompressed_chunk)

In fact since libarchive supports multiple archive formats, and nothing above is particularly ZIP-specific, it may well work with other formats.

回复收藏 0 原文

鲸落 2024-09-01 22:18:31

需要注意的是，如果要在 Python 之外使用新创建的内存中 Zip 存档，例如将其保存到本地磁盘，或通过 POST 请求发送，则需要将中央目录记录的末尾写入它;否则，它不会被识别为有效的 ZIP 文件。

这看起来像（对于 Python 3.11）

with(
    io.BytesIO() as raw,
    zipfile.ZipFile(raw, "a", zipfile.ZIP_DEFLATED, False) as zip
):
    for file_name, file_data in ["example_dir/example_file.txt", bytes]:
        zip.writestr(file_name, file_data)

    zip.close()  # THIS is REQUIRED!

    requests.post(addr, files = {"file": ("zip_name.zip", zip.getbuffer())})

It's important to note that if you want to use the newly created in-memory Zip archive outside of Python, such as saving it to a local disk, or sent through a POST request, it needs to have the end of central directory records written to it; otherwise, it won't be recognized as a valid ZIP file.

This would look like (for Python 3.11)

with(
    io.BytesIO() as raw,
    zipfile.ZipFile(raw, "a", zipfile.ZIP_DEFLATED, False) as zip
):
    for file_name, file_data in ["example_dir/example_file.txt", bytes]:
        zip.writestr(file_name, file_data)

    zip.close()  # THIS is REQUIRED!

    requests.post(addr, files = {"file": ("zip_name.zip", zip.getbuffer())})

回复收藏 0 原文

~没有更多了~