奇怪的“BadZipfile：Bad CRC-32”问题

发布于 2024-10-31 21:47:25 字数 3932 浏览 10 评论 0原文

此代码是 Django 应用程序中代码的简化，该应用程序通过 HTTP 多部分 POST 接收上传的 zip 文件并对内部数据进行只读处理：

#!/usr/bin/env python

import csv, sys, StringIO, traceback, zipfile
try:
    import io
except ImportError:
    sys.stderr.write('Could not import the `io` module.\n')

def get_zip_file(filename, method):
    if method == 'direct':
        return zipfile.ZipFile(filename)
    elif method == 'StringIO':
        data = file(filename).read()
        return zipfile.ZipFile(StringIO.StringIO(data))
    elif method == 'BytesIO':
        data = file(filename).read()
        return zipfile.ZipFile(io.BytesIO(data))


def process_zip_file(filename, method, open_defaults_file):
    zip_file    = get_zip_file(filename, method)
    items_file  = zip_file.open('items.csv')
    csv_file    = csv.DictReader(items_file)

    try:
        for idx, row in enumerate(csv_file):
            image_filename = row['image1']

            if open_defaults_file:
                z = zip_file.open('defaults.csv')
                z.close()

        sys.stdout.write('Processed %d items.\n' % idx)
    except zipfile.BadZipfile:
        sys.stderr.write('Processing failed on item %d\n\n%s' 
                         % (idx, traceback.format_exc()))


process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))

非常简单。我们打开 zip 文件以及 zip 文件中的一两个 CSV 文件。

奇怪的是，如果我使用一个大 zip 文件（~13 MB）运行它，并让它从 StringIO.StringIO 或 io 实例化 ZipFile。 BytesIO（也许除了普通文件名之外还有什么？当我尝试从 TemporaryUploadedFile 甚至文件对象创建 ZipFile 时，我在 Django 应用程序中遇到了类似的问题通过调用 os.tmpfile() 和 shutil.copyfileobj() 创建）并让它打开两个 csv 文件而不是一个，然后在处理结束时失败。以下是我在 Linux 系统上看到的输出：

$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.

$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.

顺便说一下，代码在相同的条件下失败，但在我的 OS X 系统上以不同的方式失败。它似乎读取了损坏的数据并且变得非常混乱，而不是 BadZipfile 异常。

这一切都表明我在这段代码中做了一些你不应该做的事情——例如：在一个文件上调用 zipfile.open ，同时已经在同一个 zip 文件对象中打开了另一个文件？使用 ZipFile(filename) 时这似乎不是问题，但在传递 ZipFile(filename) 类文件对象时可能会出现问题，因为zipfile 模块？

也许我错过了 zipfile 文档中的某些内容？或者也许还没有记录？或者（可能性最小），zipfile 模块中存在错误？

原文

This code is simplification of code in a Django app that receives an uploaded zip file via HTTP multi-part POST and does read-only processing of the data inside:

#!/usr/bin/env python

import csv, sys, StringIO, traceback, zipfile
try:
    import io
except ImportError:
    sys.stderr.write('Could not import the `io` module.\n')

def get_zip_file(filename, method):
    if method == 'direct':
        return zipfile.ZipFile(filename)
    elif method == 'StringIO':
        data = file(filename).read()
        return zipfile.ZipFile(StringIO.StringIO(data))
    elif method == 'BytesIO':
        data = file(filename).read()
        return zipfile.ZipFile(io.BytesIO(data))


def process_zip_file(filename, method, open_defaults_file):
    zip_file    = get_zip_file(filename, method)
    items_file  = zip_file.open('items.csv')
    csv_file    = csv.DictReader(items_file)

    try:
        for idx, row in enumerate(csv_file):
            image_filename = row['image1']

            if open_defaults_file:
                z = zip_file.open('defaults.csv')
                z.close()

        sys.stdout.write('Processed %d items.\n' % idx)
    except zipfile.BadZipfile:
        sys.stderr.write('Processing failed on item %d\n\n%s' 
                         % (idx, traceback.format_exc()))


process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))

Pretty simple. We open the zip file and one or two CSV files inside the zip file.

What's weird is that if I run this with a large zip file (~13 MB) and have it instantiate the ZipFile from a StringIO.StringIO or a io.BytesIO (Perhaps anything other than a plain filename? I had similar problems in the Django app when trying to create a ZipFile from a TemporaryUploadedFile or even a file object created by calling os.tmpfile() and shutil.copyfileobj()) and have it open TWO csv files rather than just one, then it fails towards the end of processing. Here's the output that I see on a Linux system:

$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.

$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.

Incidentally, the code fails under the same conditions but in a different way on my OS X system. Instead of the BadZipfile exception, it seems to read corrupted data and gets very confused.

This all suggests to me that I am doing something in this code that you are not supposed to do -- e.g.: call zipfile.open on a file while already having another file within the same zip file object open? This doesn't seem to be a problem when using ZipFile(filename), but perhaps it's problematic when passing ZipFile a file-like object, because of some implementation details in the zipfile module?

Perhaps I missed something in the zipfile docs? Or maybe it's not documented yet? Or (least likely), a bug in the zipfile module?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

勿忘心安 2024-11-07 21:47:25

我可能刚刚找到了问题和解决方案，但不幸的是，我不得不用我自己的一个被黑客攻击的模块（此处称为 myzipfile）替换 Python 的 zipfile 模块。

$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py        2011-04-11 11:51:59.000000000 -0700
@@ -5,6 +5,7 @@
 import binascii, cStringIO, stat
 import io
 import re
+import copy

 try:
     import zlib # We may need its compression method
@@ -877,7 +878,7 @@
         # Only open a new file for instances where we were not
         # given a file object in the constructor
         if self._filePassed:
-            zef_file = self.fp
+            zef_file = copy.copy(self.fp)
         else:
             zef_file = open(self.filename, 'rb')

标准 zipfile 模块中的问题是，当传递文件对象（而不是文件名）时，它会在每次调用 open 方法时使用相同的传入文件对象。这意味着 tell 和 seek 在同一个文件上被调用，因此尝试打开 zip 文件中的多个文件会导致文件位置被共享，因此多个 < code>open 调用导致它们互相踩踏。相反，当传递文件名时，open 打开一个新的文件对象。我的解决方案是针对传入文件对象的情况，我不是直接使用该文件对象，而是创建它的副本。

对 zipfile 的更改解决了我遇到的问题：

$ ./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

但我不知道它是否会对 zipfile 产生其他负面影响...

编辑： 我刚刚在 Python 文档中发现了对此的提及，但我之前不知何故忽略了这一点。在 http://docs.python.org/library/zipfile.html# zipfile.ZipFile.open，它说：

注意：如果 ZipFile 是通过将类似文件的对象作为第一个参数传递给
构造函数，则 open() 返回的对象共享 ZipFile 的文件指针。在这些之下
在这种情况下，open() 返回的对象不应在任何其他操作之后使用
对 ZipFile 对象执行。如果 ZipFile 是通过传入字符串（
filename) 作为构造函数的第一个参数，然后 open() 将创建一个新文件
将由 ZipExtFile 保存的对象，使其能够独立于 ZipFile 进行操作。

I might have just found the problem and the solution, but unfortunately I had to replace Python's zipfile module with a hacked one of my own (called myzipfile here).

$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py        2011-04-11 11:51:59.000000000 -0700
@@ -5,6 +5,7 @@
 import binascii, cStringIO, stat
 import io
 import re
+import copy

 try:
     import zlib # We may need its compression method
@@ -877,7 +878,7 @@
         # Only open a new file for instances where we were not
         # given a file object in the constructor
         if self._filePassed:
-            zef_file = self.fp
+            zef_file = copy.copy(self.fp)
         else:
             zef_file = open(self.filename, 'rb')

The problem in the standard zipfile module is that when passed a file object (not a filename), it uses that same passed-in file object for every call to the open method. This means that tell and seek are getting called on the same file and so trying to open multiple files within the zip file is causing the file position to be shared and so multiple open calls result in them stepping all over each other. In contrast, when passed a filename, open opens a new file object. My solution is for the case when a file object is passed in, instead of using that file object directly, I create a copy of it.

This change to zipfile fixes the problems I was seeing:

$ ./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

but I don't know if it has other negative impacts on zipfile...

EDIT: I just found a mention of this in the Python docs that I had somehow overlooked before. At http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open, it says:

Note: If the ZipFile was created by passing in a file-like object as the first argument to the
constructor, then the object returned by open() shares the ZipFile’s file pointer. Under these
circumstances, the object returned by open() should not be used after any additional operations
are performed on the ZipFile object. If the ZipFile was created by passing in a string (the
filename) as the first argument to the constructor, then open() will create a new file
object that will be held by the ZipExtFile, allowing it to operate independently of the ZipFile.

回复收藏 0 原文