奇怪的“BadZipfile:Bad CRC-32”问题
此代码是 Django 应用程序中代码的简化,该应用程序通过 HTTP 多部分 POST 接收上传的 zip 文件并对内部数据进行只读处理:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
非常简单。我们打开 zip 文件以及 zip 文件中的一两个 CSV 文件。
奇怪的是,如果我使用一个大 zip 文件(~13 MB)运行它,并让它从 StringIO.StringIO
或 io 实例化
(也许除了普通文件名之外还有什么?当我尝试从 ZipFile
。 BytesIOTemporaryUploadedFile
甚至文件对象创建 ZipFile
时,我在 Django 应用程序中遇到了类似的问题通过调用 os.tmpfile()
和 shutil.copyfileobj()
创建)并让它打开两个 csv 文件而不是一个,然后在处理结束时失败。以下是我在 Linux 系统上看到的输出:
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
顺便说一下,代码在相同的条件下失败,但在我的 OS X 系统上以不同的方式失败。它似乎读取了损坏的数据并且变得非常混乱,而不是 BadZipfile
异常。
这一切都表明我在这段代码中做了一些你不应该做的事情——例如:在一个文件上调用 zipfile.open
,同时已经在同一个 zip 文件对象中打开了另一个文件?使用 ZipFile(filename) 时这似乎不是问题,但在传递 ZipFile(filename) 类文件对象时可能会出现问题,因为zipfile
模块?
也许我错过了 zipfile 文档中的某些内容?或者也许还没有记录?或者(可能性最小),zipfile
模块中存在错误?
This code is simplification of code in a Django app that receives an uploaded zip file via HTTP multi-part POST and does read-only processing of the data inside:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
Pretty simple. We open the zip file and one or two CSV files inside the zip file.
What's weird is that if I run this with a large zip file (~13 MB) and have it instantiate the ZipFile
from a StringIO.StringIO
or a io.BytesIO
(Perhaps anything other than a plain filename? I had similar problems in the Django app when trying to create a ZipFile
from a TemporaryUploadedFile
or even a file object created by calling os.tmpfile()
and shutil.copyfileobj()
) and have it open TWO csv files rather than just one, then it fails towards the end of processing. Here's the output that I see on a Linux system:
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
Incidentally, the code fails under the same conditions but in a different way on my OS X system. Instead of the BadZipfile
exception, it seems to read corrupted data and gets very confused.
This all suggests to me that I am doing something in this code that you are not supposed to do -- e.g.: call zipfile.open
on a file while already having another file within the same zip file object open? This doesn't seem to be a problem when using ZipFile(filename)
, but perhaps it's problematic when passing ZipFile
a file-like object, because of some implementation details in the zipfile
module?
Perhaps I missed something in the zipfile
docs? Or maybe it's not documented yet? Or (least likely), a bug in the zipfile
module?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我可能刚刚找到了问题和解决方案,但不幸的是,我不得不用我自己的一个被黑客攻击的模块(此处称为
myzipfile
)替换 Python 的zipfile
模块。标准 zipfile 模块中的问题是,当传递文件对象(而不是文件名)时,它会在每次调用 open 方法时使用相同的传入文件对象。这意味着
tell
和seek
在同一个文件上被调用,因此尝试打开 zip 文件中的多个文件会导致文件位置被共享,因此多个 < code>open 调用导致它们互相踩踏。相反,当传递文件名时,open
打开一个新的文件对象。我的解决方案是针对传入文件对象的情况,我不是直接使用该文件对象,而是创建它的副本。对
zipfile
的更改解决了我遇到的问题:但我不知道它是否会对
zipfile
产生其他负面影响...编辑: 我刚刚在 Python 文档中发现了对此的提及,但我之前不知何故忽略了这一点。在 http://docs.python.org/library/zipfile.html# zipfile.ZipFile.open,它说:
I might have just found the problem and the solution, but unfortunately I had to replace Python's
zipfile
module with a hacked one of my own (calledmyzipfile
here).The problem in the standard
zipfile
module is that when passed a file object (not a filename), it uses that same passed-in file object for every call to theopen
method. This means thattell
andseek
are getting called on the same file and so trying to open multiple files within the zip file is causing the file position to be shared and so multipleopen
calls result in them stepping all over each other. In contrast, when passed a filename,open
opens a new file object. My solution is for the case when a file object is passed in, instead of using that file object directly, I create a copy of it.This change to
zipfile
fixes the problems I was seeing:but I don't know if it has other negative impacts on
zipfile
...EDIT: I just found a mention of this in the Python docs that I had somehow overlooked before. At http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open, it says:
我所做的是更新安装工具,然后重新下载,它现在可以工作
https://pypi。 python.org/pypi/setuptools/35.0.1
what i did was update setup tools then re download and it works now
https://pypi.python.org/pypi/setuptools/35.0.1
就我而言,这解决了问题:
In my case, this solved the problem:
难道是你在桌面上打开的?我有时会遇到这种情况,解决方案就是运行代码,而不在 python 会话之外打开文件。
could it be that you had it open in your desktop? It has happened sometimes to me and the solution was just to run the code without having the files open outside of the python session.