如何防止我的 PDF 到 SVG 转换代码生成臃肿的内容?

发布于 2024-09-30 16:59:42 字数 1203 浏览 6 评论 0原文

我想将 PDF 转换为 SVG。我使用 Apache PDFBox 和 Batik 库编写了自己的 Java 程序。

PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
    GenericDOMImplementation.getDOMImplementation();

// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);

// Ask the test to render into the SVG Graphics2D implementation.

    for(int i = 0 ; i < document.getNumberOfPages() ; i++){
        String svgFName = svgDir+"page"+i+".svg";
        (new File(svgFName)).createNewFile();
        // Create an instance of the SVG Generator.
        SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
        Printable page  = document.getPrintable(i);
        page.print(svgGenerator, document.getPageFormat(i), i);
        svgGenerator.stream(svgFName);
    }

此解决方案有效,但生成的 SVG 文件的大小很大(比原始 PDF 大很多倍)。我通过在文本编辑器中查看 SVG 找出了问题所在:它将原始文档中的每个字符都包含在自己的 中。即使字符的字体属性相同, 也会阻止。

例如,单词“hello”将显示为 6 个不同的文本块。

有没有办法修复上面的代码?或者还有其他更有效的解决方案吗?

I want to convert PDF to SVG. I have written my own Java program using the Apache PDFBox and Batik libraries

PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
    GenericDOMImplementation.getDOMImplementation();

// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);

// Ask the test to render into the SVG Graphics2D implementation.

    for(int i = 0 ; i < document.getNumberOfPages() ; i++){
        String svgFName = svgDir+"page"+i+".svg";
        (new File(svgFName)).createNewFile();
        // Create an instance of the SVG Generator.
        SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
        Printable page  = document.getPrintable(i);
        page.print(svgGenerator, document.getPageFormat(i), i);
        svgGenerator.stream(svgFName);
    }

This solution works, but the size of the resulting SVG files is huge (many times greater than the originating PDF). I have figured out where the problem is by looking at the SVG in a text editor: it encloses every character in the original document in its own <text> </text> block even if the font properties of the characters are the same.

For example the word "hello" will appear as 6 different text blocks.

Is there a way to fix the above code? Or is there another solution that will work more efficiently?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

┊风居住的梦幻卍 2024-10-07 16:59:43

Inkscape 还可用于将 PDF 转换为 SVG。它实际上非常擅长于此,尽管它生成的代码有点臃肿,但至少,它似乎没有您在程序中遇到的特定问题。我认为将其直接集成到 Java 中会很有挑战性,但 inkscape 为该功能提供了一个方便的命令行界面,因此访问它的最简单方法可能是通过系统调用。

要使用 Inkscape 的命令行界面将 PDF 转换为 SVG,请使用:

inkscape -l out.svg in.pdf

然后您可以使用以下方式调用:

Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime。 html#exec%28java.lang.String%29

我认为 exec() 是同步的,仅在进程完成后返回(尽管我对此不是 100% 确定),所以你应该能够阅读之后是“out.svg”。无论如何,谷歌搜索“java系统调用”将产生有关如何正确执行该部分的更多信息。

Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.

To use Inkscape's command-line interface to convert a PDF to an SVG, use:

inkscape -l out.svg in.pdf

Which you can then probably call using:

Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29

I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.

完美的未来在梦里 2024-10-07 16:59:43

看看pdf2svg(也在在 github 上):

使用

pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]

当使用 all 时,使用 给出文件名%d (将被页码替换)。

pdf2svg input.pdf output_page%d.svg all

对于一些故障排除,请参阅:
http://www.calcmaster.net/personal_projects/pdf2svg/

Take a look at pdf2svg (also on on github):

To use

pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]

When using all give a filename with %d in it (which will be replaced by the page number).

pdf2svg input.pdf output_page%d.svg all

And for some troubleshooting see:
http://www.calcmaster.net/personal_projects/pdf2svg/

动次打次papapa 2024-10-07 16:59:43

pdftocairo 可用于将 PDF 转换为 SVG。
它是 poppler-utils 的一部分,可以从 < a href="https://pypi.org/project/poppler-utils/" rel="nofollow noreferrer">PyPI 通过 pip、从 git 构建或通过操作系统包manager (例如 ubuntu/deb 具有相同的名称)

例如,要转换 PDF 的第二页,可以运行以下命令:

pdftocairo -svg -f 1 -l 1 input.pdf

pdftocairo can be used to convert PDF to SVG.
It's part of poppler-utils which can be installed either from PyPI via pip, built from git, or via your OS package manager (eg ubuntu/deb has it under this same name).

For example to convert the second page of a PDF, the following command can be run:

pdftocairo -svg -f 1 -l 1 input.pdf
尘世孤行 2024-10-07 16:59:43

在尝试转换大型且复杂的 PDF(例如 USGS 的一些地形图)时,我遇到了建议的 inkscape、pdf2svg 或 pdftocairo 工具以及未建议的 Convert 和 mutool 工具的问题。有时它们会崩溃,有时它们会产生大量膨胀的文件。

唯一能够针对我的用例正确处理所有这些内容的 PDF 到 SVG 转换工具是 dvisvgm。使用它非常简单:

dvisvgm --pdf --output=file.svg file.pdf

它有各种额外的选项来处理元素的转换方式以及优化。如有必要,可以通过 svgcleaner 进一步压缩生成的文件,而不会造成感知质量损失。

I have encountered issues with the suggested inkscape, pdf2svg, or pdftocairo tools, as well as the not-suggested convert and mutool tools, when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files.

The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:

dvisvgm --pdf --output=file.svg file.pdf

It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.

岛徒 2024-10-07 16:59:43

您可以在 *nix 环境中使用 bash。

突发操作将 PDF 中的每个页面拆分为文件。 to-svg.sh 循环遍历这些单页 PDF 以生成

pdftk 82page.pdf burst
sh to-svg.sh 

to-svg.sh 的关联 SVG 文件内容

#!/bin/bash
FILES=burst/*
for f in $FILES
do
  inkscape -l "$f.svg" "$f"
done

You can use bash in a *nix environment.

The burst operation splits each page in the PDF into files. to-svg.sh loops through these single-page PDFs to generate the associated SVG file

pdftk 82page.pdf burst
sh to-svg.sh 

contents of to-svg.sh

#!/bin/bash
FILES=burst/*
for f in $FILES
do
  inkscape -l "$f.svg" "$f"
done
月亮坠入山谷 2024-10-07 16:59:43

Inkscape 不再与 -l 选项一起使用。它说“无法打开文件:/out.svg(不存在)”。该选项的长格式在手册页中为 --export-plain-svg 并且可以工作,但会显示弃用警告。我能够通过使用 Inkscape 1.1.2-3ubuntu4 上的 -o 选项来修复和更新命令:

inkscape in.pdf -o out.svg

Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:

inkscape in.pdf -o out.svg
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文