处理 Word 文档的最佳方法

发布于 2024-10-04 02:11:32 字数 1539 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

如果没结果 2024-10-11 02:11:32

查看 python-docx 库。

Take a look at the python-docx library.

窗影残 2024-10-11 02:11:32

所以我认为您是说文档的结构是在格式中编码的,并且您想要生成捕获该结构的 XML 文件,同时将内容保留为纯文本?

如果是这样,您将需要解析文档,并构建一个可以处理的数据结构,然后转储为 XML。

对于解析,有几个选项。 Microsoft 已发布其二进制 .doc 格式的规范,阅读这对于为其编写解析器至关重要。对于 .docx,您比较幸运,因为它已经是 XML 格式,因此您可以使用任何 XML 解析库来读取该文件,然后在结果树中搜索您感兴趣的数据。解析器几乎适用于任何语言,我想到的一种易于使用的解析器是 MiniDom Python。

为了生成输出 XML,XML 库的对象表示似乎是一种可行的方法,例如 MiniDom 也可以这样做。

如果您不想编写自己的 .doc 解析器,您可以通过转换器运行文档,首先生成更易于访问的格式 - 例如使用 Word 本身将 .doc 文件转换为 .docx,或者使用工具从 .docs 生成 RDF,或者您可以使用现有的单词解析器,例如 OpenOffice 中的解析器。

So I think you're saying that the structure of the document is encoded in the formatting, and you want to produce XML files that capture that structure, whilst keeping the content in plain text?

If that is so you will need to parse the documents, and build a data structure that can be processed, then dumped out as XML.

For parsing, there are a few options. Microsoft have published the specifications for their binary .doc format, the reading of which will be essential to write a parser for it. In the case of .docx you're a little more lucky, as it's already in XML format, so you could use any XML parsing library to read in the file, then search through the resulting tree for the data you are interested in. XML parsers are available for pretty much any language, one easy to use one that comes to mind is MiniDom for Python.

For generating your output XML, again an object-representation to XML library seems to be the way to go, MiniDom for example, does that too.

If you don't want to deal with writing your own .doc parser, you could run the documents through a converter that produces are more accessible format first - such as using Word itself to convert the .doc files to .docx, or a tool that produces RDFs from .docs, or you could use an existing word parser such as the one in OpenOffice.

紫轩蝶泪 2024-10-11 02:11:32

在 VBA 中使用非常低效的条件搜索将文档逐字复制到第二个文档中。然后使用 .xml 扩展名保存第二个文档。任务完成了,但是很丑。

Used a very inefficient conditional search in VBA to literally copy the document into a second document. The second document was then saved with a .xml extension. Got the job done, but its ugly.

烟雨扶苏 2024-10-11 02:11:32

您还可以尝试基于 Java 的 Apache POI - HWPF。它支持文本提取。然后,您必须创建自己的 XML 文档,Caster XMLXstream 可以帮助您解决这个问题。

You can also try Java based Apache POI - HWPF. It supports text extraction. You will then have to create you own XML doc, Caster XML or Xstream can help you on that issue.

幸福丶如此 2024-10-11 02:11:32

这实际上取决于您想要做什么。

最简单的方法是将文档另存为 Flat OPC XML(在 Word 中,“另存为...”XML),然后应用 XSLT。

这种方法最简单,因为它将整个 docx 作为单个 XML 文件提供,因此您不必解压缩它等。

如果您的要求更复杂,例如,分析格式或样式,或者使用超链接执行某些操作,那么诸如 docx4j (Java) 或 Open XML SDK (C#) 之类的对象模型(毫无疑问还有其他模型)可能会有所帮助。

It really depends on exactly what you are trying to do.

The simplest approach would be to save the document as Flat OPC XML (in Word, "Save as.." XML), and then apply an XSLT.

This approach is simplest, since it gives you the entire docx as a single XML file, so you don't have to unzip it etc.

If your requirements are more complex, for example, analyzing the formatting or styles, or doing something with hyperlinks, then an object model such as docx4j (Java) or Open XML SDK (C#) - and no doubt there are others - may help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文