在 Java 中以编程方式将 Word 文档转换为 HTML

发布于 2024-07-07 08:13:16 字数 207 浏览 6 评论 0原文

我需要用 Java 将 Word 文档转换为 HTML 文件。 该函数将输入一个 Word 文档,输出将是基于 Word 文档页数的 html 文件,即,如果 Word 文档有 3 页,则将生成 3 个具有所需分页符的 html 文件。

我搜索了可以将 doc 转换为 html 的开源/非商业 API,但没有结果。 请以前做过此类工作的人帮忙。

谢谢

I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.

I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

俯瞰星空 2024-07-14 08:13:17

在新的 MS Word docx 中执行此操作更容易,因为格式为 XML。
您可以使用 XSL 将 XML 格式的 Word 文档转换为 HTML 格式。

但是,如果您的 Word 文档是旧版本,您可以使用 POI 库
http://poi.apache.org/ 然后访问它并生成一个 Java 对象,然后从该点开始您可以使用 HTML java 库轻松将其转换为 HTML 格式

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

It is easier to do this in the new MS word docx as the format is in XML.
You can use an XSL to transform the Word doc in XML format to an HTML format.

If however your Word doc is in an old version, you can use POI library
http://poi.apache.org/ and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

咆哮 2024-07-14 08:13:17

我看到这个帖子出现在外部链接中,并且偶尔有帖子,所以我想我应该发布更新(希望没有人介意)。 OpenOffice 不断发展,3.2 版本再次改进了单词导入导出过滤器。 OpenOffice 和 Java 可以在许多平台上运行,因此 Java 系统可以直接使用 OpenOffice UNO API 来导入/操作/导出多种格式的文档(包括 word 和 pdf)或使用像 JODReportsDocmosis 以提供便利。 两者都有免费/开放选项。

I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReports or Docmosis to facilitate. Both have free/open options.

西瑶 2024-07-14 08:13:17

我尝试了这种方式,它在这个网站上与我一起工作
http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

这只适用于 docx 将其转换为该 word 文档中包含图像的 html。

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

我希望这能解决您的问题

I tried this way and its work with me from this site
http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

This only work with docx to convert it into html included images inside that word document.

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

I hope this solve your issue

风月客 2024-07-14 08:13:17

你必须找到 MS Word 文档规范(因为它基本上是当时 Word 中任何内容的二进制转储),然后慢慢地逐个元素地浏览它,将 MS Word“对象/状态”转换为 html相当于 您也许可以找到一个脚本来为您做这件事,因为这确实不是一件有趣的工作,我建议您不要这样做(转换文件格式甚至自己读取商业文件总是很困难而且通常不完整)。
PS:只需 谷歌 doc2html

You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ).
PS: just google doc2html

朕就是辣么酷 2024-07-14 08:13:17

如果您的目标是使用 ooxml 格式的 Word 2007 文件,那么这篇文章 可能有帮助。
还有
Ooxml4j 项目正在为 Java 库实现 ooxml。

如果您的目标是二进制文件...那就是另一个问题。

If you are targeting word 2007 files using the ooxml format then this article might help.
And there is the Ooxml4j project which is implementing ooxml for Java library.

If you are targeting the binary files though...thats another problem.

老子叫无熙 2024-07-14 08:13:17
import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

所有可能的转换:

doc --> pdf、html、txt、rtf

xls --> pdf、html、csv

ppt --> pdf、swf

html --> pdf

import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

All possible conversions:

doc --> pdf, html, txt, rtf

xls --> pdf, html, csv

ppt --> pdf, swf

html --> pdf

剪不断理还乱 2024-07-14 08:13:17

您可以使用 microsoft office online

首先,在服务器端请求 https://view.officeapps.live.com/op/view.aspx?src='您的文档文件在线网址',

然后使用 jsoup 解析结果 html

当从移动设备访问时,html 将有一个框架。

you can use micrsoft office online

first, on server side request https://view.officeapps.live.com/op/view.aspx?src='your doc file online url'

then use jsoup parse the result html

when access from mobile the html will have a frame wrapped.

氛圍 2024-07-14 08:13:16

我推荐 JODConverter,它利用 OpenOffice.org,可以说它提供了最好的导入/导出过滤器OpenDocument 和 Microsoft Office 格式现已推出。

JODConverter 有大量文档、脚本和教程可以帮助您。

I recommend the JODConverter, It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.

JODConverter has a lot of documents, scripts, and tutorials to help you out.

自控 2024-07-14 08:13:16

我已在新的 MS Word XML 格式不可用的生产系统中成功使用了以下方法:

生成一个执行类似操作的进程:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html< /a>

您可能希望在程序启动时启动 openoffice 一次,并在程序期间根据需要多次调用 python 脚本(进行某种检查以确保 ooffice 进程始终存在)。

另一个选项是每次需要进行转换时生成以下命令:

ooffice -headless "macro://<要转换的 ooffice vb 宏的路径,参数指向文件>"

我已经多次使用宏方法并且效果很好(抱歉,我没有可用的宏代码)。

虽然有通过 MS Word 执行此操作的机制,但从 Java 中实现这些操作并不容易,并且确实需要其他支持程序来通过 OLE 驱动 MS Word。

我以前也使用过 abiword,它适用于许多文档,但确实会与更复杂的文档混淆(ooffice 似乎可以处理我扔给它的所有内容)。 Abiword 的转换命令行界面比 ooffice 稍微简单一些。

I've used the following approach successfully in production systems where the new MS Word XML format isn't available:

Spawn a process that does something similar to:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).

The other option is to spawn the following sort of command every time you need to do the conversion:

ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.

丑疤怪 2024-07-14 08:13:16

我们使用 tm-extractors (http://mvnrepository.com/artifact/org.textmining/ tm-extractors),并回退到商业 Aspose(http://www.aspose.com/ )。 两者都有本机 Java API。

We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

裸钻 2024-07-14 08:13:16

如果它是 docx,您可以使用 docx4j (ASL v2)。 这使用 XSLT 创建 HTML。

但是,它会为您提供整个文档的单个 HTML。

如果您想要每页一个 HTML,您可以使用 Word 放入 docx 中的 LastRenderedPageBreak 标记执行某些操作(假设您使用 Word 创建它)。

If its a docx, you could use docx4j (ASL v2). This uses XSLT to create the HTML.

However, it will give you a single HTML for the whole document.

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文