如何在Python中检索office文件的作者?

发布于 2024-11-29 03:46:34 字数 352 浏览 1 评论 0原文

标题解释了问题,有 doc 和 docs 文件,我想检索它们的作者信息,以便我可以重组我的文件。

os.stat 仅返回大小和日期时间以及真实文件相关信息。
open(filename, 'rb').read(200) 返回许多我无法解析的字符。

有一个名为 xlrd 的模块,用于读取 xlsx 文件。然而,这仍然不允许我阅读 docdocx 文件。我知道新的 Office 文件在非 msoffice 程序上不容易读取,因此如果不可能,从旧的 Office 文件中收集信息就足够了。

Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.

os.stat returns only size and datetime, real-file related information.
open(filename, 'rb').read(200) returns many characters that I could not parse.

There is a module called xlrd for reading xlsx files. Yet, this still doesn't let me read doc or docx files. I am aware of new office files are not easily read on non-msoffice programs, so if that's impossible, gathering info from old office files would suffice.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

摇划花蜜的午后 2024-12-06 03:46:34

由于 docx 文件只是压缩的 XML,因此您只需解压缩 docx 文件,然后就可以从 XML 文件中提取作者信息。不太确定它会存储在哪里,只是简单地环顾一下它,我怀疑它存储为 docProps/core.xml 中的 dc:creator

以下是打开 docx 文件并检索创建者的方法:

import zipfile, lxml.etree

# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text

Since docx files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator in docProps/core.xml.

Here's how you can open the docx file and retrieve the creator:

import zipfile, lxml.etree

# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text
完美的未来在梦里 2024-12-06 03:46:34

您可以使用 COM 互操作来访问 Word 对象模型。此链接讨论了该技术: http ://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/

使用任何 Office 对象时的秘密是知道要访问哪个项目来自大量的方法和属性。在这种情况下,每个文档都有一个BuiltInDocumentProperties 列表。感兴趣的属性是“最后一个作者”。

打开文档后,您将通过诸如 word.ActiveDocument.BuiltInDocumentProperties("Last Author") 之类的内容访问作者

You can use COM interop to access the Word object model. This link talks about the technique: http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/

The secret when working with any of the office objects is knowing what item to access from the overwhelming amount of methods and properties. In this case each document has a list of BuiltInDocumentProperties . The property of interest is "Last Author".

After you open the document you will access the author with something like word.ActiveDocument.BuiltInDocumentProperties("Last Author")

温柔女人霸气范 2024-12-06 03:46:34

使用 docx 库怎么样?您可以获取有关该文件的更多信息,而不仅仅是作者。

#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx


import docx

file_name = 'file_path_name.doxs'

document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)

此处查找有关 docx 库和 github 帐户的更多信息位于此处

How about using docx library. You could pull more information about the file not only author.

#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx


import docx

file_name = 'file_path_name.doxs'

document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)

find more information about the docx library here and the github account is here

记忆里有你的影子 2024-12-06 03:46:34

对于旧的 Office 文档(.doc、.xls),您可以使用 hachoir-metadata

它不能很好地适应新的文件格式:例如,它可以解析 .xlsx 文件,但不会为您提供作者姓名。

For old office documents (.doc, .xls) you can use hachoir-metadata.

It does not work well with the new file formats: for example, it can parse .xlsx files, but will not provide you with an Author name.

白衬杉格子梦 2024-12-06 03:46:34

较新的 Office 格式只是包含 xml 文件的 zip 容器。您可以在这里查看 https: //github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py 是一种非常简单直接的方法。

列出的代码可以轻松扩展为 OpenOffice 格式。

伪代码:

z = zipfile.ZipFile(filename, 'r')
data = _zipfile.read('docProps/core.xml')
    or
data = _zipfile.read('meta.xml')
doc = xml.dom.minidom.parseString(data)
tag = "data you're interested in"
metadata_string = doc.getElementsByTagName(tag)[0].childNodes[0].data

要在其中搜索元数据的文件:

  • docProps/core.xml(对于 MS Office 文件)
  • meta.xml(对于 OpenOffice 文件)

您可以搜索的标签的非详尽列表:

来自都柏林核心命名空间规则:dc

  • 标题:dc:title
  • 创建者(最新修改):dc:creator
  • 描述:dc :描述
  • 主题: dc:subject
  • 日期(最后修改):dc:date
  • 语言:???

根据 ODF 规范:meta

  • 生成器(创建软件应用程序):meta:generator
  • 关键字:meta:keyword
  • 初始创建者:???
  • 创建日期和时间:meta:creation-date
  • 修改日期和时间:???
  • 打印日期和时间:???
  • 文档模板:meta:template(属性中的数据)
  • 文档统计(字数、页数等):meta:document-statistic(属性中的数据)

MS Office具体:

  • 创建日期和时间:dcterms:created
  • 日期(上次修改):dcterms:modified
  • 最近修改的创建者:cp:lastModifiedBy

The newer Office formats are just zip containers containing xml files. You can have a look here https://github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py for a very simple straightforward approach.

The code listed is easily extendable for OpenOffice formats.

Pseudocode:

z = zipfile.ZipFile(filename, 'r')
data = _zipfile.read('docProps/core.xml')
    or
data = _zipfile.read('meta.xml')
doc = xml.dom.minidom.parseString(data)
tag = "data you're interested in"
metadata_string = doc.getElementsByTagName(tag)[0].childNodes[0].data

Files to search metadata in:

  • docProps/core.xml for MS Office files
  • meta.xml for OpenOffice files

A non-exhaustive list of tags you can search for:

From the Dublin core namespace rules: dc

  • Title: dc:title
  • Creator (of most recent modification): dc:creator
  • Description: dc:description
  • Subject: dc:subject
  • Date (last modified): dc:date
  • Language: ???

From the ODF specification: meta

  • Generator (creating software application): meta:generator
  • Keywords: meta:keyword
  • Initial Creator: ???
  • Creation Date and Time: meta:creation-date
  • Modification Date and Time: ???
  • Print Date and Time: ???
  • Document Template: meta:template (data in attributes)
  • Document Statistics (word count, page count, etc.): meta:document-statistic (data in attributes)

MS Office specific:

  • Creation Date and Time: dcterms:created
  • Date (last modified): dcterms:modified
  • Creator of most recent modification: cp:lastModifiedBy
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文