本机 Java 文档解析器和转换器库/基于 Linux 的文档转换器
我正在寻找一个 Java 库,它可以执行以下操作:
解析 *.eml 或 *.msg 格式的电子邮件中 DOC、DOCX、JPEG、PNG、GIF、TXT、XLS、XLSX、PPT、PDF 类型的附件并转换TIFF 格式的附件。
它可以是开源库,也可以是商业库。或者,我正在寻找用于 linux 的命令行工具来执行此操作。我们已经尝试过Open Office,但是某些文档格式存在太多问题。
更新:
到目前为止我通过研究发现:
对于解析电子邮件和提取附件,JavaMail (http://www.oracle.com/technetwork/java/javamail/index.html) 是一个不错的选择。
对于转换文档,JodConverter (http://code.google.com/p/jodconverter/) 是一个舒适的库。然而,它只是 Open Office 的包装,所以如果 Open Office 在转换文档时出现问题(我经常在 OpenOffice 中遇到问题),您也可以使用 JodConcerter 来解决。
总之,我(到目前为止)没有运气找到任何用本机 java 实现的文档转换库,它可以处理所有常见的文档格式,既不是开源的,也不是商业的。这似乎是一个真正的市场缺口。
I'm looking for a Java library which can do the following:
parse emails in *.eml or *.msg format for attachments of type DOC,DOCX,JPEG,PNG,GIF,TXT,XLS,XLSX,PPT,PDF and convert the attachmens to the TIFF format.
It can be either open source or a comercial library. Alternatively I'm looking for command line tools for linux doing this. We already tried open office, but there are too many problems with some document formats.
UPDATE:
What I found out by research up to now:
For parsing emails and extracting attachments, JavaMail (http://www.oracle.com/technetwork/java/javamail/index.html) is a good choice.
For converting documents, JodConverter (http://code.google.com/p/jodconverter/) is a confortable library. However it's only a wrapper for open office, so if there are issues with open office (and I do have often trouble with openoffice) to convert a document, you will have them also with JodConcerter.
In conclusion I had no luck (up to now) to find any document conversion library implemented in native java, which handels all common document formats, neither open source or even commercial. It seems to be a real market gap.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
RainbowPDF 可能适合:它是一个带有 Java API 的基于商业服务器的转换工具。
如果您有 Windows 服务器,请查看 NEEVIA Document Converter Pro。它具有一些邮件功能。
Apac POI 是一个用于读取 Microsoft Office 文档内容的界面。您必须自己编写图像生成和布局组件的代码。尽管如此,它还是读取 Outlook MSG 格式。
RainbowPDF may fit: its a commercial server based conversion tool with Java API.
If you've got a Windows server, have a look at NEEVIA Document Converter Pro. It has some mail functionality.
Apace POI is an interface to read the content of Microsoft Office documents. You will have to code the image generating and layouting components on your own. Nervertheless it reads Outlook MSG format.
Apache POI - 用于 Microsoft 文档的 Java API。但是我不知道如何轻松地将解析的文档转换为 TIFF。
Apache POI - the Java API for Microsoft Documents. However I don't know how to easily convert parsed document to TIFF.
不同方法的混合可能有用吗?根据您的要求,可以使用多个库来转换您需要管理的所有格式:Microsoft Office、Adobe PDF、一些不同的图像格式和简单的文本文件。
我的意思是,您可以创建一个进程,根据提取的文件类型(使用 Java Mail),您可以识别文件的格式,并使用合适的库使用正确的转换机制继续处理。然后您将确定文件是否是要转换的图像,请尝试 Java Advanced Imaging,如果是 Microsoft Office 文件,请尝试 Apache POI 等。要管理 PDF 文件,您可以尝试 Apache PDFBox,它是另一个很好的开源解决方案。
顺便说一句,如果您不仅在寻找 Java 方法,可能是 此主题可能对您有帮助。
我不知道是否有比 @ChrisGer 评论的更好的商业解决方案。
May be a mix of different approaches could be useful? Depending on your requirements, could be possible to use several libraries to convert all the formats you need to manage: Microsoft Office, Adobe PDF, some different image formats and simple text files.
I mean, you can create a process that, depending on the type of the file extracted (using Java Mail), you could recognize what kind of format the file has and continue processing with the right conversion mechanism using the suitable library. Then you will idenfity if a file it's an image to convert, try Java Advanced Imaging, if it's a Microsoft Office file, try Apache POI and so on. For managing PDF files, you can try Apache PDFBox it's another good and opensource solution.
By the way, if you are looking not only for a Java approach, may be this thread may help you.
I don't know if there are better commercial solutions than @ChrisGer commented.
不要浪费时间查看 Apache POI,因为它只能解析 Office 文件的内容,但不适合渲染它。
由于有可用的 OpenOffice 服务器,我建议您这样做。我还知道您可以轻松地使用 DCOM 与 Microsoft Office 应用程序进行通信,也许 Java->DCOM 桥接器更能胜任这项任务。然而,微软甚至不推荐这样做(所以我认为 JodConverter 的东西同样不稳定)。
Do not waste your time looking at Apache POI, as it can only parse the content of the Office files but is not suitable for rendering it.
Since there are OpenOffice servers available, I suggest you do this. I also know you can easily use DCOM to talk with Microsoft Office apps, maybe a Java->DCOM bridge is more up to the task. However, this is not even recommended by Microsoft (so I suppose the JodConverter thing is equally unstable).