在 JAVA 中使用 Apache POI 和 iText 从 Word (DOC) 创建 PDF
我正在尝试从 *.doc 文档生成 PDF 文档。 到目前为止,多亏了 stackoverflow,我已经成功生成了它,但存在一些问题。
我下面的示例代码生成的 pdf 没有格式和图像,只有文本。 该文档包含 PDF 中未包含的空白和图像。
这是代码:
in = new FileInputStream(sourceFile.getAbsolutePath());
out = new FileOutputStream(outputFile);
WordExtractor wd = new WordExtractor(in);
String text = wd.getText();
Document pdf= new Document(PageSize.A4);
PdfWriter.getInstance(pdf, out);
pdf.open();
pdf.add(new Paragraph(text));
I am trying to generate a PDF document from a *.doc document.
Till now and thanks to stackoverflow I have success generating it but with some problems.
My sample code below generates the pdf without formatations and images, just the text.
The document includes blank spaces and images which are not included in the PDF.
Here is the code:
in = new FileInputStream(sourceFile.getAbsolutePath());
out = new FileOutputStream(outputFile);
WordExtractor wd = new WordExtractor(in);
String text = wd.getText();
Document pdf= new Document(PageSize.A4);
PdfWriter.getInstance(pdf, out);
pdf.open();
pdf.add(new Paragraph(text));
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
docx4j 包含用于创建 PDF 的代码使用 iText 的 docx。它还可以使用 POI 将 doc 转换为 docx。
曾经有一段时间,我们同等支持这两种方法(以及通过 XHTML 的 PDF),但我们决定重点关注 XSL-FO。
如果可以的话,您最好使用 docx4j 通过 XSL-FO 和 FOP 将 docx 转换为 PDF。
像这样使用它:
2016 年 7 月更新
从 docx4j 3.3.0 开始,Plutext 的商业 PDF 渲染器是 docx4j 用于 docx 到 PDF 转换的默认选项。 可以在 converter-eval.plutext.com 尝试在线演示
如果您想使用现有的 docx, XSL-FO 到 PDF(或 Apache FOP 支持的其他目标)方法,然后只需将 docx4j-export-FO jar 添加到类路径中。
无论哪种方式,要将 docx 转换为 PDF,您可以使用 Docx4J 外观的 toPDF 方法。
通过 iText 代码将旧 docx 转换为 PDF 的方法可以在 https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/
docx4j includes code for creating a PDF from a docx using iText. It can also use POI to convert a doc to a docx.
There was a time when we supported both methods equally (as well as PDF via XHTML), but we decided to focus on XSL-FO.
If its an option, you'd be much better off using docx4j to convert a docx to PDF via XSL-FO and FOP.
Use it like so:
Update July 2016
As of docx4j 3.3.0, Plutext's commercial PDF renderer is docx4j's default option for docx to PDF conversion. You can try an online demo at converter-eval.plutext.com
If you want to use the existing docx to XSL-FO to PDF (or other target supported by Apache FOP) approach, then just add the docx4j-export-FO jar to your classpath.
Either way, to convert docx to PDF, you can use the Docx4J facade's toPDF method.
The old docx to PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/
WordExtractor 只抓取纯文本,没有其他内容。这就是为什么您看到的只是纯文本。
您需要做的是单独获取每个段落,然后抓取每次运行,获取格式,并生成 PDF 中的等效内容。
一种选择可能是找到一些将 XHTML 转换为 PDF 的代码。然后,使用 Apache Tika 将您的 Word 文档转换为 XHTML(它在幕后使用 POI,并为您处理所有格式设置),并从 XHTML 转换为 PDF。
否则,如果您要自己动手,请查看 Apache Tika 中用于解析 Word 文件的代码。这是一个很好的例子,展示了如何获取图像、格式、样式等。
WordExtractor just grabs the plain text, nothing else. That's why all you're seeing is the plain text.
What you'll need to do is get each paragraph individually, then grab each run, fetch the formatting, and generate the equivalent in PDF.
One option may be to find some code that turns XHTML into a PDF. Then, use Apache Tika to turn your word document into XHTML (it uses POI under the hood, and handles all the formatting stuff for you), and from the XHTML on to PDF.
Otherwise, if you're going to do it yourself, take a look at the code in Apache Tika for parsing word files. It's a really great example of how to get at the images, the formatting, the styles etc.
我已成功使用 Apache FOP 将“WordML”文档转换为 PDF。 WordML 是 Office 2003 将 Word 文档另存为 xml 的方法。可以在 Web 上找到 XSLT 样式表,以将此 xml 转换为 xml-fo,而 xml-fo 又可以由 FOP 呈现为 PDF(以及其他输出)。
它与 plutext 提供的解决方案没有太大不同,只是它不读取 .doc 文档,而 docx4j 显然可以。如果您的需求足够灵活,可以将 WordML 样式文档作为输入,那么这可能值得研究。
祝你的项目好运!
维姆
I have succesfully used Apache FOP to convert a 'WordML' document to PDF. WordML is the Office 2003 way of saving a Word document as xml. XSLT stylesheets can be found on the web to transform this xml to xml-fo which in turn can be rendered by FOP into PDF (among other outputs).
It's not so different from the solution plutext offered, except that it doesn't read a .doc document, whereas docx4j apparently does. If your requirements are flexible enough to have WordML style documents as input, this might be worth looking into.
Good luck with your project!
Wim
使用 OpenOffice/LbreOffice 和 JODConnector
这也主要适用于 .doc 到 .docx。不过我还没有解决图形问题。
您需要将 LibreOffice 作为服务器运行才能完成这项工作。
从命令行,您可以使用以下命令执行此操作:
Use OpenOffice/LbreOffice and JODConnector
This also mostly works for .doc to .docx. Problems with graphics that I have not yet worked out though.
You need to be running LibreOffice as a serverto make this work.
From the command line you can do this using;
我最近遇到的另一个选择是使用 OpenOffice(或 LibreOffice)API(请参阅此处)。我还没能进入这个,但它应该能够打开各种格式的文档并以 pdf 格式输出。如果你研究这个,请告诉我它是如何工作的!
Another option I came across recently is using the OpenOffice (or LibreOffice) API (see here). I have not been able to get into this but it should be able to open documents in various formats and output them in a pdf format. If you look into this, let me know how it worked!