用于将 MS Office 文件格式(doc、docx、ppt 等)转换为纯文本的 Web 服务?
更大的背景:我们正在开发一个 Intranet 门户的搜索引擎,它需要能够在所有办公类型中进行搜索:doc、docx、xls、xlsx、ppt 和 pptx。搜索算法已经就位后,我们已经使用 Office 自动化实现了索引器;然而,客户担心,这是 1,容易出错,2,Microsoft 不推荐(并且不在其许可证中)。
我已经阅读了之前在这方面的答案,但是这需要我们集成大量不同的库来覆盖所有边缘,而我们没有资源这样做。
因此,我们正在寻找一个简单的 Web 服务,我们可以向该服务提交任何这些文档,并返回一个简单的纯文本(或 html,甚至 PDF - 我们对两者都有解析器)输出。
是否有涵盖上述所有文件格式的此类服务(免费或付费)?
非常感谢。
Larger context: we're working on an Intranet portal's search engine, which needs to be able to search within ALL office types: doc, docx, xls,xlsx, ppt, and pptx. Having the search algo already in place, we've implemented the indexer using Office automation; however, client is concerned, that this is 1, error-prone, and 2, not recommended by Microsoft (and also -not covered in their license).
I've read the previous answers in this regard on SO, however it would require us to integrate an extremely large amount of distinct libraries to cover all the edges, which we don't have the resource to do so.
Hence, we're looking for a simple web service, to which we can submit any of these documents, and would return a simple, plain text (or html, or even PDF -we've got parsers for both) output.
Are there any such services (free, or paid), that covers all of the file formats above?
Many thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议尝试 Apache Tika - 它是免费且开源的。它允许从 MS Office 文件格式(以及其他流行格式)中提取文本内容。其中包含一个服务器应用程序,您可以在自己的服务器上运行它。
I would suggest to try Apache Tika - it's free and open source. It allows to extract text contents from MS Office file formats (and from other popular formats, too). There is a server application included which you can run on your own server.
我确信该服务,但是如果您可以管理和部署 DOC/DOCX、XLS/XLSX 和 PPT/PPTX 的三个 .NET 程序集。那么你可以尝试 Aspose 组件 -- 分别为 Aspose.Words、Aspose.Cells 和 Aspose.Slides。这些 DLL 不需要在您的服务器上安装 MS Office,并且它们可以在任何 Windows 操作系统和 32 位/64 位环境中正常工作。您还可以查看文档。这些组件还提供了许多高级功能来处理文档元素。请看看这对您的情况是否有帮助。
披露:我在 Aspose 担任开发人员传播者。
I'm note sure about the service, however if you can managed and deploy three .NET assemblies for DOC/DOCX, XLS/XLSX, and PPT/PPTX. Then you may try Aspose components -- Aspose.Words, Aspose.Cells, and Aspose.Slides respectively. These DLLs don't require MS Office to be installed on your server and they work fine on any Windows OS and on 32-bit/64-bit environments. You may also see the documentation. These components provide many advanced features to deal with document elements as well. Please see if this might help in your scenario.
Disclosure: I work as developer evangelist at Aspose.