当前位置：文江博客话题详情

在 Java 中以编程方式将 Word 文档转换为 HTML

发布于 2024-07-07 08:13:16 字数 207 浏览 6 评论 0原文

我需要用 Java 将 Word 文档转换为 HTML 文件。该函数将输入一个 Word 文档，输出将是基于 Word 文档页数的 html 文件，即，如果 Word 文档有 3 页，则将生成 3 个具有所需分页符的 html 文件。

我搜索了可以将 doc 转换为 html 的开源/非商业 API，但没有结果。请以前做过此类工作的人帮忙。

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

俯瞰星空 2024-07-14 08:13:17

在新的 MS Word docx 中执行此操作更容易，因为格式为 XML。
您可以使用 XSL 将 XML 格式的 Word 文档转换为 HTML 格式。

但是，如果您的 Word 文档是旧版本，您可以使用 POI 库
http://poi.apache.org/ 然后访问它并生成一个 Java 对象，然后从该点开始您可以使用 HTML java 库轻松将其转换为 HTML 格式

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

回复收藏 0 原文

咆哮 2024-07-14 08:13:17

我看到这个帖子出现在外部链接中，并且偶尔有帖子，所以我想我应该发布更新（希望没有人介意）。 OpenOffice 不断发展，3.2 版本再次改进了单词导入导出过滤器。 OpenOffice 和 Java 可以在许多平台上运行，因此 Java 系统可以直接使用 OpenOffice UNO API 来导入/操作/导出多种格式的文档（包括 word 和 pdf）或使用像 JODReports 或 Docmosis 以提供便利。两者都有免费/开放选项。

回复收藏 0 原文

西瑶 2024-07-14 08:13:17

我尝试了这种方式，它在这个网站上与我一起工作
http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

这只适用于 docx 将其转换为该 word 文档中包含图像的 html。

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

我希望这能解决您的问题

I tried this way and its work with me from this site
http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

This only work with docx to convert it into html included images inside that word document.

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

I hope this solve your issue

回复收藏 0 原文

风月客 2024-07-14 08:13:17

你必须找到 MS Word 文档规范（因为它基本上是当时 Word 中任何内容的二进制转储），然后慢慢地逐个元素地浏览它，将 MS Word“对象/状态”转换为 html相当于您也许可以找到一个脚本来为您做这件事，因为这确实不是一件有趣的工作，我建议您不要这样做（转换文件格式甚至自己读取商业文件总是很困难而且通常不完整）。
PS：只需谷歌 doc2html

回复收藏 0 原文

朕就是辣么酷 2024-07-14 08:13:17

如果您的目标是使用 ooxml 格式的 Word 2007 文件，那么这篇文章可能有帮助。
还有 Ooxml4j 项目正在为 Java 库实现 ooxml。

如果您的目标是二进制文件...那就是另一个问题。

回复收藏 0 原文

老子叫无熙 2024-07-14 08:13:17

import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

所有可能的转换：

doc --> pdf、html、txt、rtf

xls --> pdf、html、csv

ppt --> pdf、swf

html --> pdf

import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

All possible conversions:

doc --> pdf, html, txt, rtf

xls --> pdf, html, csv

ppt --> pdf, swf

html --> pdf

回复收藏 0 原文