将 docX 转换为自定义 XML
我一直在尝试将 docX 文件转换为我定制的 XML。我的用户希望将他们的数据转换为此 XML,以便在他们的 Web 应用程序中更轻松地进行内容查询,并且他们希望输入来自他们的 docX。
我尝试过在 Java 中寻找转换器 API,但似乎都不符合我的要求。我研究过 docx4j 但意识到它只能转换为 HTML 和 PDF。我在想是否存在一个转换器 API,我可以输入中间转换器 (XSLT),并且输出将是我的自定义 XML,其中包含来自我的 docX 的数据。
有现成的工具吗?如果没有,对我在编码自己的转换器时必须采取的方法有何建议,例如从 openXML,在自定义 XML 之前首先转换为 XSL-FO?
希望听到社区的声音。
非常感谢。
I have been trying to convert my docX files to a XML I have custom-made. My users want their data converted to this XML for easier content query in their web app and they want the input to be from their docX.
I have tried looking for converter API in Java but none seem to fit my requirement. I have looked into docx4j but realized that it only converts to HTML and PDF. I am thinking if there exists a converter API to which I can input, say, an intermediate translator (XSLT) and the output would be my custom XML complete with the data from my docX.
Is there an existing tool for this? If there is none, any suggestions on the approach I have to take in coding my own converter e.g. from openXML, convert to XSL-FO first before the custom XML?
Would love to hear from the community.
Thank you very much.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
docx4j 可用于通过 XSLT 将 OpenXML 转换为任意 XML。
假设 Templates xslt 和 javax.xml.transform.stream.StreamResult result,你会做这样的事情:
但是,如果您只想转换为 XML,那么 docx4j(以及 Apache POI)就太过分了。您可以直接使用 OpenXML4J 。
不过,通过 XSLT 进行转换是否是最佳方法,取决于您的目标 XML 是面向文档的还是面向数据的。
如果是面向文档的,XSLT 是一个很好的方法。
如果它是面向数据的,您可能需要考虑内容控制数据绑定。 (还有另一种方法,称为 customxml,但如果您依赖 Word 进行编辑,i4i 专利闹剧可能会使该方法变得不可取)
docx4j can be used to convert OpenXML to arbitrary XML via XSLT.
Assuming Templates xslt and javax.xml.transform.stream.StreamResult result, you'd do something like this:
However, if all you want to do is to transform to XML, then docx4j (and Apache POI for that matter), are overkill. You could just use OpenXML4J directly.
Whether conversion via XSLT is the best approach though, depends on whether your target XML is document-oriented, or data-oriented.
If it is document-oriented, XSLT is a good approach.
If it is data-oriented, you might want to consider content control data-binding. (There was another approach, called customxml, but the i4i patent farce may make that approach inadvisable if you are relying on Word for editing)
据我所知,docx 文件只是 ZIP 容器中的 xml 文件。要将这些转换为设计的某种 XML 格式,您需要解压缩文件(到新文件夹或内存中),加载目标 Xml 文档,并将 XSLT 应用到该 xml 文件。我认为除了“docx4j”标签之外,你没有提到任何有关你的开发环境的内容。你是用 Java 开发的吗?如果是这样,我担心我不知道应该向您指出哪些库来获取 zip 处理和 xml 转换库(尽管我知道它们存在,并且只需要 5 分钟的 google 搜索即可找到它们! )
要查看 docx 中的 xml 文件,只需将文件的扩展名从“.docx”更改为“.zip”,然后在您最喜欢的 ZIP 存档工具中打开即可。
To the best of my knowledge, docx files are simply xml files in a ZIP container. To convert these to some XML format of your design, you would need to unzip the file (into new folder or into memory), load the target Xml document, and apply your XSLT to that xml file. I don't think you mention anything about your development environment, except the "docx4j" tag.. Are you developing in Java? If so, I'm afraid I wouldn't know what libraries to point you to for the zip-handling and xml-transformation libraries (although I know they exist, and it would only take a 5-minute google search to find them!)
To check out the xml files in a docx, simply change the extension of the file from ".docx" to ".zip" and open in your favorite ZIP archive tool.
我最幸运的是直接从 Word 中将 docx 保存为 html。 Html 不是 xHtml,因此您需要对其进行清理。否则,如果您必须使用基于 Word 的工作流程,它的效果相当好。您也可以编写 VBA 脚本,让 Word 打开文件并以编程方式将其保存到 Html。
I've had the most luck saving docx as html right from Word. The Html is not xHtml so you'd need to run a tidy on it. Otherwise, it works fairly well if you must use a Word-based workflow. You can write a VBA script to have Word open a file and save it to Html programmatically, too.