如何在Python中检索office文件的作者？

发布于 2024-11-29 03:46:34 字数 352 浏览 1 评论 0原文

标题解释了问题，有 doc 和 docs 文件，我想检索它们的作者信息，以便我可以重组我的文件。

os.stat 仅返回大小和日期时间以及真实文件相关信息。
open(filename, 'rb').read(200) 返回许多我无法解析的字符。

有一个名为 xlrd 的模块，用于读取 xlsx 文件。然而，这仍然不允许我阅读 doc 或 docx 文件。我知道新的 Office 文件在非 msoffice 程序上不容易读取，因此如果不可能，从旧的 Office 文件中收集信息就足够了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

摇划花蜜的午后 2024-12-06 03:46:34

由于 docx 文件只是压缩的 XML，因此您只需解压缩 docx 文件，然后就可以从 XML 文件中提取作者信息。不太确定它会存储在哪里，只是简单地环顾一下它，我怀疑它存储为 docProps/core.xml 中的 dc:creator 。

以下是打开 docx 文件并检索创建者的方法：

import zipfile, lxml.etree

# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text

Since docx files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator in docProps/core.xml.

Here's how you can open the docx file and retrieve the creator:

import zipfile, lxml.etree

# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text

回复收藏 0 原文

完美的未来在梦里 2024-12-06 03:46:34

您可以使用 COM 互操作来访问 Word 对象模型。此链接讨论了该技术： http ://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/

使用任何 Office 对象时的秘密是知道要访问哪个项目来自大量的方法和属性。在这种情况下，每个文档都有一个BuiltInDocumentProperties 列表。感兴趣的属性是“最后一个作者”。

打开文档后，您将通过诸如 word.ActiveDocument.BuiltInDocumentProperties("Last Author") 之类的内容访问作者

回复收藏 0 原文

温柔女人霸气范 2024-12-06 03:46:34

使用 docx 库怎么样？您可以获取有关该文件的更多信息，而不仅仅是作者。

#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx


import docx

file_name = 'file_path_name.doxs'

document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)

在此处查找有关 docx 库和 github 帐户的更多信息位于此处

How about using docx library. You could pull more information about the file not only author.

#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx


import docx

file_name = 'file_path_name.doxs'

document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)

find more information about the docx library here and the github account is here

回复收藏 0 原文

记忆里有你的影子 2024-12-06 03:46:34

对于旧的 Office 文档（.doc、.xls），您可以使用 hachoir-metadata。

它不能很好地适应新的文件格式：例如，它可以解析 .xlsx 文件，但不会为您提供作者姓名。

回复收藏 0 原文

白衬杉格子梦 2024-12-06 03:46:34

较新的 Office 格式只是包含 xml 文件的 zip 容器。您可以在这里查看 https： //github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py 是一种非常简单直接的方法。

列出的代码可以轻松扩展为 OpenOffice 格式。

伪代码：

z = zipfile.ZipFile(filename, 'r')
data = _zipfile.read('docProps/core.xml')
    or
data = _zipfile.read('meta.xml')
doc = xml.dom.minidom.parseString(data)
tag = "data you're interested in"
metadata_string = doc.getElementsByTagName(tag)[0].childNodes[0].data

要在其中搜索元数据的文件：

docProps/core.xml（对于 MS Office 文件）
meta.xml（对于 OpenOffice 文件）

您可以搜索的标签的非详尽列表：

来自都柏林核心命名空间规则：`dc`

标题：dc:title
创建者（最新修改）：dc:creator
描述：dc ：描述
主题： dc:subject
日期（最后修改）：dc:date
语言：???

根据 ODF 规范：`meta`

生成器（创建软件应用程序）：meta:generator
关键字：meta:keyword
初始创建者：???
创建日期和时间：meta:creation-date
修改日期和时间：???
打印日期和时间：???
文档模板：meta:template（属性中的数据）
文档统计（字数、页数等）：meta:document-statistic（属性中的数据）

MS Office具体：

创建日期和时间：dcterms:created
日期（上次修改）：dcterms:modified
最近修改的创建者：cp:lastModifiedBy

The newer Office formats are just zip containers containing xml files. You can have a look here https://github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py for a very simple straightforward approach.

The code listed is easily extendable for OpenOffice formats.

Pseudocode:

z = zipfile.ZipFile(filename, 'r')
data = _zipfile.read('docProps/core.xml')
    or
data = _zipfile.read('meta.xml')
doc = xml.dom.minidom.parseString(data)
tag = "data you're interested in"
metadata_string = doc.getElementsByTagName(tag)[0].childNodes[0].data

Files to search metadata in:

docProps/core.xml for MS Office files
meta.xml for OpenOffice files

A non-exhaustive list of tags you can search for:

From the Dublin core namespace rules: `dc`

Title: dc:title
Creator (of most recent modification): dc:creator
Description: dc:description
Subject: dc:subject
Date (last modified): dc:date
Language: ???

From the ODF specification: `meta`

Generator (creating software application): meta:generator
Keywords: meta:keyword
Initial Creator: ???
Creation Date and Time: meta:creation-date
Modification Date and Time: ???
Print Date and Time: ???
Document Template: meta:template (data in attributes)
Document Statistics (word count, page count, etc.): meta:document-statistic (data in attributes)

MS Office specific:

Creation Date and Time: dcterms:created
Date (last modified): dcterms:modified
Creator of most recent modification: cp:lastModifiedBy

回复收藏 0 原文

~没有更多了~

关于作者

魔法唧唧

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何在Python中检索office文件的作者？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

要在其中搜索元数据的文件：

您可以搜索的标签的非详尽列表：

来自都柏林核心命名空间规则：`dc`

根据 ODF 规范：`meta`

MS Office具体：

Files to search metadata in:

A non-exhaustive list of tags you can search for:

From the Dublin core namespace rules: `dc`

From the ODF specification: `meta`

MS Office specific:

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如何在Python中检索office文件的作者？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

要在其中搜索元数据的文件：

您可以搜索的标签的非详尽列表：

来自都柏林核心命名空间规则：dc

根据 ODF 规范：meta

MS Office具体：

Files to search metadata in:

A non-exhaustive list of tags you can search for:

From the Dublin core namespace rules: dc

From the ODF specification: meta

MS Office specific:

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

来自都柏林核心命名空间规则：`dc`

根据 ODF 规范：`meta`

From the Dublin core namespace rules: `dc`

From the ODF specification: `meta`