解析和生成 Microsoft Office 2007 文件(.docx、.xlsx、.pptx)
我有一个 Web 项目,必须从用户提供的文档中导入文本和图像,其中一种可能的格式是 Microsoft Office 2007。还需要生成这种格式的文档。
服务器运行 CentOS 5.2,并安装了 PHP/Perl/Python。 如果需要,我可以执行本地二进制文件和 shell 脚本。 我们使用 Apache 2.2,但一旦上线就会切换到 Nginx。
我有什么选择? 有人有这方面的经验吗?
I have a web project where I must import text and images from a user-supplied document, and one of the possible formats is Microsoft Office 2007. There's also a need to generate documents in this format.
The server runs CentOS 5.2 and has PHP/Perl/Python installed. I can execute local binaries and shell scripts if I must. We use Apache 2.2 but will be switching over to Nginx once it goes live.
What are my options? Anyone had experience with this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Office 2007 文件格式是开放的并且有详细记录。 粗略地说,所有以“x”结尾的新文件格式都是 zip 压缩的 XML 文档。 例如:
其他文件格式大致相似。 我还不知道有任何开源库可以与它们交互 - 但根据您的具体要求,阅读和编写简单文档看起来并不太困难。 当然,它应该比旧格式容易得多。
如果您需要阅读较旧的格式,OpenOffice 有一个 API,并且可以或多或少成功地读写 Office 2003 和较旧的文档。
The Office 2007 file formats are open and well documented. Roughly speaking, all of the new file formats ending in "x" are zip compressed XML documents. For example:
The other file formats are roughly similar. I don't know of any open source libraries for interacting with them as yet - but depending on your exact requirements, it doesn't look too difficult to read and write simple documents. Certainly it should be a lot easier than with the older formats.
If you need to read the older formats, OpenOffice has an API and can read and write Office 2003 and older documents with more or less success.
python docx 模块可以从纯 Python 生成格式化的 Microsoft Office docx 文件。 它开箱即用,可以处理标题、段落、表格和项目符号,但 makeelement() 模块可以扩展以处理任意元素,例如图像。
The python docx module can generate formatted Microsoft office docx files from pure Python. Out of the box, it does headers, paragraphs, tables, and bullets, but the makeelement() module can be extended to do arbitrary elements like images.
我已在项目中成功使用 OpenXML Format SDK 来修改通过代码的 Excel 电子表格。 这需要 .NET,而且我不确定它在 Mono 下的工作效果如何。
I have successfully used the OpenXML Format SDK in a project to modify an Excel spreadsheet via code. This would require .NET and I'm not sure about how well it would work under Mono.
您或许可以检查 Sphider 的代码。 他们有文档和 PDF,所以我确信他们可以阅读它们。 还可能引导您了解其他 Office 格式的正确方向。
You can probably check the code for Sphider. They docs and pdfs, so I'm sure they can read them. Might also lead you in the right direction for other Office formats.