We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
新的 Office 2007 格式只是 (ZIP) 压缩的 XML。
所有文本(至少为 .docx 格式)都位于(解压缩文件后)word 文件夹 document.xml 文件中。将其从所有 XML 标记中剥离出来,您将获得文本。毫无疑问,您将丢失格式,但如果您想要进行文本索引或类似的操作,那么格式无论如何都是不相关的。订单被保留。
我没有分析Excel和Powerpoint,但方法应该是相似的。 Excel 可能会比较棘手,具体取决于单元格在 XML 文件中的存储方式。
The new office 2007 format is just (ZIP) compressed XML.
All the text (in at least the .docx format) is located (once you decompress the file) in the word folder, document.xml file. Strip it from all the XML tags and you'll get the text. You'll lose the formatting no doubt, but if you want to do text indexing or something like it format isn't relevant anyway. The order is preserved.
I haven't analyzed Excel and Powerpoint but the approach should be similar. Excel might be trickier, depending on how are the cells stored in the XML file.
Apache POI 库可以从 Office 格式中提取文本。这是 Lucene 中的 Tika 使用的。 Tika 可以作为命令行工具执行:
The Apache POI library can extract text from office formats. This is used by Tika in Lucene. Tika can be executed as a command line tool:
PyODConverter 用于自动化 OpenOffice。用它来进行转换。
OONinja 示例 将 Doc 转换为 PDF但任何 OpenOffice 支持的导入或导出都应该可以工作。如果需要的话,还具有无头工作的优点。
其他选项包括,
Abiword
或者你真的只是想处理命令行 WvWare 但我不认为它支持 Docx,
PyODConverter for automating OpenOffice. Use it to do the conversions.
OONinja example converting Doc to PDF but any OpenOffice supported imports or exports should work. Also has the advantage of working Headless if required.
other options include,
Abiword
or you really just want to deal with command line WvWare but I don't think it supports Docx,
您可以通过适当的许可证在您的应用程序中使用 Autonomy Keyview。它看起来非常强大,几乎可以从所有东西中提取文本;我们用它来识别任意格式文件中的文本。
我不知道许可条款是什么,但可以从您的客户经理那里获取:)
You can use Autonomy Keyview with the appropriate licence to use in your application. It seems to be extremely powerful and can extract text from almost everything; we use it to identify text within arbitrary format files.
I've no idea what the licensing terms are, but they're available from your account manager :)