如何在Python中检索office文件的作者?
标题解释了问题,有 doc 和 docs 文件,我想检索它们的作者信息,以便我可以重组我的文件。
os.stat
仅返回大小和日期时间以及真实文件相关信息。open(filename, 'rb').read(200)
返回许多我无法解析的字符。
有一个名为 xlrd
的模块,用于读取 xlsx
文件。然而,这仍然不允许我阅读 doc
或 docx
文件。我知道新的 Office 文件在非 msoffice 程序上不容易读取,因此如果不可能,从旧的 Office 文件中收集信息就足够了。
Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.
os.stat
returns only size and datetime, real-file related information.open(filename, 'rb').read(200)
returns many characters that I could not parse.
There is a module called xlrd
for reading xlsx
files. Yet, this still doesn't let me read doc
or docx
files. I am aware of new office files are not easily read on non-msoffice
programs, so if that's impossible, gathering info from old office files would suffice.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
由于 docx 文件只是压缩的 XML,因此您只需解压缩 docx 文件,然后就可以从 XML 文件中提取作者信息。不太确定它会存储在哪里,只是简单地环顾一下它,我怀疑它存储为
docProps/core.xml
中的dc:creator
。以下是打开 docx 文件并检索创建者的方法:
Since
docx
files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored asdc:creator
indocProps/core.xml
.Here's how you can open the docx file and retrieve the creator:
您可以使用 COM 互操作来访问 Word 对象模型。此链接讨论了该技术: http ://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
使用任何 Office 对象时的秘密是知道要访问哪个项目来自大量的方法和属性。在这种情况下,每个文档都有一个BuiltInDocumentProperties 列表。感兴趣的属性是“最后一个作者”。
打开文档后,您将通过诸如 word.ActiveDocument.BuiltInDocumentProperties("Last Author") 之类的内容访问作者
You can use COM interop to access the Word object model. This link talks about the technique: http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
The secret when working with any of the office objects is knowing what item to access from the overwhelming amount of methods and properties. In this case each document has a list of BuiltInDocumentProperties . The property of interest is "Last Author".
After you open the document you will access the author with something like word.ActiveDocument.BuiltInDocumentProperties("Last Author")
使用 docx 库怎么样?您可以获取有关该文件的更多信息,而不仅仅是作者。
在此处查找有关 docx 库和 github 帐户的更多信息位于此处
How about using
docx
library. You could pull more information about the file not only author.find more information about the docx library here and the github account is here
对于旧的 Office 文档(.doc、.xls),您可以使用 hachoir-metadata。
它不能很好地适应新的文件格式:例如,它可以解析 .xlsx 文件,但不会为您提供作者姓名。
For old office documents (.doc, .xls) you can use hachoir-metadata.
It does not work well with the new file formats: for example, it can parse .xlsx files, but will not provide you with an Author name.
较新的 Office 格式只是包含 xml 文件的 zip 容器。您可以在这里查看 https: //github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py 是一种非常简单直接的方法。
列出的代码可以轻松扩展为 OpenOffice 格式。
伪代码:
要在其中搜索元数据的文件:
docProps/core.xml
(对于 MS Office 文件)meta.xml
(对于 OpenOffice 文件)您可以搜索的标签的非详尽列表:
来自都柏林核心命名空间规则:
dc
dc:title
dc:creator
dc :描述
dc:subject
dc:date
根据 ODF 规范:
meta
meta:generator
meta:keyword
meta:creation-date
meta:template
(属性中的数据)meta:document-statistic
(属性中的数据)MS Office具体:
dcterms:created
dcterms:modified
cp:lastModifiedBy
The newer Office formats are just zip containers containing xml files. You can have a look here https://github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py for a very simple straightforward approach.
The code listed is easily extendable for OpenOffice formats.
Pseudocode:
Files to search metadata in:
docProps/core.xml
for MS Office filesmeta.xml
for OpenOffice filesA non-exhaustive list of tags you can search for:
From the Dublin core namespace rules:
dc
dc:title
dc:creator
dc:description
dc:subject
dc:date
From the ODF specification:
meta
meta:generator
meta:keyword
meta:creation-date
meta:template
(data in attributes)meta:document-statistic
(data in attributes)MS Office specific:
dcterms:created
dcterms:modified
cp:lastModifiedBy