迭代 pathlib 路径和 python-docx: zipfile.BadZipFile

发布于 2025-01-13 18:08:57 字数 2931 浏览 3 评论 0原文

由于我最近主要使用 Rstats,所以我的 Python 技能有点生疏。但是我遇到了以下问题,我的目标是我想递归地迭代目录中的所有 .docx 文件,并使用 python-docx 更改一些核心属性> 包。

对于循环,我首先使用 pathlib 和 glob 创建了一个列表

from docx import Document
from docx.shared import Inches
import pathlib

# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files

文件的输出看起来不错。

[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
 WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]

当我现在想要读取包含列表的文档时,我收到 zip 错误(请参阅下面的完整回溯)

document = Document(files[1])
Traceback (most recent call last):
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-482c5438fa33>", line 1, in <module>
    document = Document(files[1])
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
    self._zipf = ZipFile(pkg_file, 'r')
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

但是,只要运行同一行代码,没有列表就可以正常工作(除了路径分隔符 /r"\",我认为这应该不重要,因为列表包含 pathlib.Path 对象。

document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))

编辑评论

我为此mre总共创建了4个新的word文件。现在我在其中两个中输入了文本,其中两个是空的。令我惊讶的是,我发现空的会导致错误。

for file in files:
    try:
        document = Document(file)
    except:
        print(f"The file: {file} appears to be corrupted")

输出:

The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted

为未来读者提供的半解决方案

在对 Document("Path/to/file.docx") 的调用周围添加 tryexcept 块,并打印出该函数失败的相应文件。就我而言,只有几个,我可以轻松地手动编辑。

My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx files in a directory and change some of the core attributes with the python-docx package.

For the loop, I first created a list with pathlib and glob

from docx import Document
from docx.shared import Inches
import pathlib

# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files

Output of files looks fine.

[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
 WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]

When I now want to read in a document with the list I get a zip error (see full traceback below)

document = Document(files[1])
Traceback (most recent call last):
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-482c5438fa33>", line 1, in <module>
    document = Document(files[1])
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
    self._zipf = ZipFile(pkg_file, 'r')
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

However just running the same line of code, without the list works fine (except for differences in the path separator / and r"\", which I thought should not matter due to the fact that the lists contains pathlib.Path objects).

document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))

Edit to Comment

I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.

for file in files:
    try:
        document = Document(file)
    except:
        print(f"The file: {file} appears to be corrupted")

Output:

The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted

Semi Solution to Future Readers

Add a try and except block around the call to Document("Path/to/file.docx"), and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

凉栀 2025-01-20 18:08:57

您没有做错,因为文档是空的,您会收到此错误。如果您打开这些文件并键入一些内容,则不会出现任何错误。但
根据 https://python-docx.readthedocs.io/en/ latest/user/documents.html

可以用不同的代码打开word文档。

第一:

document = Document()
document.save(files[1])

第二:

document = Document(files[1])
document.save(files[1])

另外根据文档,您可以像文件一样打开它们:

with open(files[1], 'rb') as f:
    document = Document(f)

You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But
According to https://python-docx.readthedocs.io/en/latest/user/documents.html

You can open word documents with different codes.

First:

document = Document()
document.save(files[1])

Second:

document = Document(files[1])
document.save(files[1])

Also According to docs you can open them like files:

with open(files[1], 'rb') as f:
    document = Document(f)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文