在 JAVA 中使用 Apache POI 和 iText 从 Word (DOC) 创建 PDF

发布于 2024-11-08 14:56:29 字数 528 浏览 8 评论 0原文

我正在尝试从 *.doc 文档生成 PDF 文档。到目前为止，多亏了 stackoverflow，我已经成功生成了它，但存在一些问题。

我下面的示例代码生成的 pdf 没有格式和图像，只有文本。该文档包含 PDF 中未包含的空白和图像。

这是代码：

        in = new FileInputStream(sourceFile.getAbsolutePath());
        out = new FileOutputStream(outputFile);

        WordExtractor wd = new WordExtractor(in);

        String text = wd.getText();

        Document pdf= new Document(PageSize.A4);

        PdfWriter.getInstance(pdf, out);

        pdf.open();
        pdf.add(new Paragraph(text));

原文

I am trying to generate a PDF document from a *.doc document.
Till now and thanks to stackoverflow I have success generating it but with some problems.

My sample code below generates the pdf without formatations and images, just the text.
The document includes blank spaces and images which are not included in the PDF.

Here is the code:

        in = new FileInputStream(sourceFile.getAbsolutePath());
        out = new FileOutputStream(outputFile);

        WordExtractor wd = new WordExtractor(in);

        String text = wd.getText();

        Document pdf= new Document(PageSize.A4);

        PdfWriter.getInstance(pdf, out);

        pdf.open();
        pdf.add(new Paragraph(text));

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

狼亦尘 2024-11-15 14:56:29

docx4j 包含用于创建 PDF 的代码使用 iText 的 docx。它还可以使用 POI 将 doc 转换为 docx。

曾经有一段时间，我们同等支持这两种方法（以及通过 XHTML 的 PDF），但我们决定重点关注 XSL-FO。

如果可以的话，您最好使用 docx4j 通过 XSL-FO 和 FOP 将 docx 转换为 PDF。

像这样使用它：

        wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

        // Set up font mapper
        Mapper fontMapper = new IdentityPlusMapper();
        wordMLPackage.setFontMapper(fontMapper);

        // Example of mapping missing font Algerian to installed font Comic Sans MS
        PhysicalFont font 
                = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");
        fontMapper.getFontMappings().put("Algerian", font);             

        org.docx4j.convert.out.pdf.PdfConversion c 
            = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
        //  = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);

        OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf");         
        c.output(os);

2016 年 7 月更新

从 docx4j 3.3.0 开始，Plutext 的商业 PDF 渲染器是 docx4j 用于 docx 到 PDF 转换的默认选项。可以在 converter-eval.plutext.com 尝试在线演示

如果您想使用现有的 docx， XSL-FO 到 PDF（或 Apache FOP 支持的其他目标）方法，然后只需将 docx4j-export-FO jar 添加到类路径中。

无论哪种方式，要将 docx 转换为 PDF，您可以使用 Docx4J 外观的 toPDF 方法。

通过 iText 代码将旧 docx 转换为 PDF 的方法可以在 https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/

docx4j includes code for creating a PDF from a docx using iText. It can also use POI to convert a doc to a docx.

There was a time when we supported both methods equally (as well as PDF via XHTML), but we decided to focus on XSL-FO.

If its an option, you'd be much better off using docx4j to convert a docx to PDF via XSL-FO and FOP.

Use it like so:

        wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

        // Set up font mapper
        Mapper fontMapper = new IdentityPlusMapper();
        wordMLPackage.setFontMapper(fontMapper);

        // Example of mapping missing font Algerian to installed font Comic Sans MS
        PhysicalFont font 
                = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");
        fontMapper.getFontMappings().put("Algerian", font);             

        org.docx4j.convert.out.pdf.PdfConversion c 
            = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
        //  = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);

        OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf");         
        c.output(os);

Update July 2016

As of docx4j 3.3.0, Plutext's commercial PDF renderer is docx4j's default option for docx to PDF conversion. You can try an online demo at converter-eval.plutext.com

If you want to use the existing docx to XSL-FO to PDF (or other target supported by Apache FOP) approach, then just add the docx4j-export-FO jar to your classpath.

Either way, to convert docx to PDF, you can use the Docx4J facade's toPDF method.

The old docx to PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/

回复收藏 0 原文

紙鸢 2024-11-15 14:56:29

WordExtractor 只抓取纯文本，没有其他内容。这就是为什么您看到的只是纯文本。

您需要做的是单独获取每个段落，然后抓取每次运行，获取格式，并生成 PDF 中的等效内容。

一种选择可能是找到一些将 XHTML 转换为 PDF 的代码。然后，使用 Apache Tika 将您的 Word 文档转换为 XHTML（它在幕后使用 POI，并为您处理所有格式设置），并从 XHTML 转换为 PDF。

否则，如果您要自己动手，请查看 Apache Tika 中用于解析 Word 文件的代码。这是一个很好的例子，展示了如何获取图像、格式、样式等。

回复收藏 0 原文

勿忘初心 2024-11-15 14:56:29

我已成功使用 Apache FOP 将“WordML”文档转换为 PDF。 WordML 是 Office 2003 将 Word 文档另存为 xml 的方法。可以在 Web 上找到 XSLT 样式表，以将此 xml 转换为 xml-fo，而 xml-fo 又可以由 FOP 呈现为 PDF（以及其他输出）。

它与 plutext 提供的解决方案没有太大不同，只是它不读取 .doc 文档，而 docx4j 显然可以。如果您的需求足够灵活，可以将 WordML 样式文档作为输入，那么这可能值得研究。

祝你的项目好运！
维姆

回复收藏 0 原文

却一份温柔 2024-11-15 14:56:29

使用 OpenOffice/LbreOffice 和 JODConnector
这也主要适用于 .doc 到 .docx。不过我还没有解决图形问题。

    private static void transformDocXToPDFUsingJOD(File in, File out)
{
    OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
    DocumentFormat pdf = converter.getFormatRegistry().getFormatByExtension("pdf");
    converter.convert(in, out, pdf);
}



private static OfficeManager officeManager;

@BeforeClass
public static void setupStatic() throws IOException {

    /*officeManager = new DefaultOfficeManagerConfiguration()
      .setOfficeHome("C:/Program Files/LibreOffice 3.6")
      .buildOfficeManager();
      */
    officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();


    officeManager.start();
}

@AfterClass
public static void shutdownStatic() throws IOException {

    officeManager.stop();
}

您需要将 LibreOffice 作为服务器运行才能完成这项工作。
从命令行，您可以使用以下命令执行此操作：

"C:\Program Files\LibreOffice 3.6\program\soffice.exe" -accept="socket,host=0.0.0.0,port=8100;urp;LibreOffice.ServiceManager" -headless -nodefault -nofirststartwizard -nolockcheck -nologo -norestore

Use OpenOffice/LbreOffice and JODConnector
This also mostly works for .doc to .docx. Problems with graphics that I have not yet worked out though.

    private static void transformDocXToPDFUsingJOD(File in, File out)
{
    OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
    DocumentFormat pdf = converter.getFormatRegistry().getFormatByExtension("pdf");
    converter.convert(in, out, pdf);
}



private static OfficeManager officeManager;

@BeforeClass
public static void setupStatic() throws IOException {

    /*officeManager = new DefaultOfficeManagerConfiguration()
      .setOfficeHome("C:/Program Files/LibreOffice 3.6")
      .buildOfficeManager();
      */
    officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();


    officeManager.start();
}

@AfterClass
public static void shutdownStatic() throws IOException {

    officeManager.stop();
}

You need to be running LibreOffice as a serverto make this work.
From the command line you can do this using;

"C:\Program Files\LibreOffice 3.6\program\soffice.exe" -accept="socket,host=0.0.0.0,port=8100;urp;LibreOffice.ServiceManager" -headless -nodefault -nofirststartwizard -nolockcheck -nologo -norestore

回复收藏 0 原文