使用 Ghostscript 将 pdf 页面的整个媒体框渲染为 png 文件

发布于 2024-11-17 01:40:33 字数 783 浏览 7 评论 0原文

我正在尝试使用 Ghostscript v9.02 将 Pdfs 页面渲染为 png 文件。为此,我使用以下命令行:

gswin32c.exe -sDEVICE=png16m -o outputFile%d.png mypdf.pdf

当 pdf 裁剪框与媒体框,但如果裁剪框小于媒体框,则仅显示媒体框,并且 pdf 页面的边框会丢失。
我知道通常 pdf 查看器只显示裁剪框,但我需要能够在 png 文件中看到整个媒体页面。

Ghostscript 文档说默认情况下会渲染文档的媒体框,但这不起作用就我而言。 任何人都知道如何使用 Ghostscript 实现渲染整个媒体框?
是否对于 png 文件设备,仅渲染裁剪框?我可能忘记了特定的命令吗?

例如,此 pdf 包含裁剪框之外的一些注册标记,这些注册标记不存在于输出 png 文件中。有关此 pdf 的更多信息:

  • 媒体盒:
    • 宽度:667
    • 身高:908 分
  • 裁剪框:
    • 宽度:640
    • 身高:851
    • 宽度

I'm trying to render Pdfs pages into png files using Ghostscript v9.02. For that purpose I'm using the following command line:

gswin32c.exe -sDEVICE=png16m -o outputFile%d.png mypdf.pdf

This is working fine when the pdf crop box is the same as the media box, but if the crop box is smaller than the media box, only the media box is displayed and the border of the pdf page is lost.
I know usually pdf viewers only display the crop box but I need to be able to see the whole media page in my png file.

Ghostscript documentation says that per default the media box of a document is rendered, but this does not work in my case.
As anyone an idea how I could achieve rendering the whole media box using ghostscript?
Could it be that for png file device, only the crop box is rendered? Am I maybe forgetting a specific command?

For example, this pdf contains some registration marks outside of the crop box, which are not present in the output png file. Some more information about this pdf:

  • media box:
    • width: 667
    • height: 908 pts
  • crop box:
    • width: 640
    • height: 851

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不奢求什么 2024-11-24 01:40:33

好吧,现在 revers 已经将他的问题重新表述为他正在寻找“通用代码”,让我再试一次。

“通用代码”的问题在于,PDF 中可能出现许多“CropBox”语句的“合法”形式表示。以下所有选项都是可能且正确的,并为页面的 CropBox 设置相同的值:

  • /CropBox[10 20 500 700]

  • /CropBox[ 10 20 500 700 ]

  • /CropBox[10 20 500 700 ]

  • <代码>/CropBox [10 20 500 700]

  • /CropBox [ 10 20 500 700 ]

  • <代码>/CropBox [ 10.00 20.0000 500.0 700]

  • /CropBox [    
              10    
              20    
              500    
              700    
             ] 

对于 ArtBoxTrimBoxBleedBoxCropBoxMediaBox 也是如此。因此,如果您想编辑 PDF 源代码中的 *Box 表示,则需要对其进行“规范化”。

第一步:“规范化”PDF 源代码

操作方法如下:

  1. 下载 qpdf 适用于您的操作系统平台。
  2. 在输入的 PDF 上运行此命令:
    qpdf --qdf input.pdf output.pdf

output.pdf 现在将具有一种标准化结构(类似于给出的最后一个示例)上面),即使使用像 sed 这样的流编辑器,编辑也会更容易。

第二步:删除所有多余的 *Box 语句

接下来,您需要知道唯一必需的 *Box 是 MediaBox。这一个必须存在,其他是可选的(以某种优先顺序的方式)。如果缺少其他项,则它们默认与 MediaBox 具有相同的值。因此,为了实现您的目标,我们只需删除与它们相关的所有代码即可。我们将在 sed 的帮助下完成此操作。

该工具通常安装在所有 Linux 系统上 - 在 Windows 上,请从 gnuwin32.sf.net 下载并安装它。 (如果您决定使用 .zip 文件而不是安装 .exe,请不要忘记安装指定的“依赖项”)。

现在运行此命令:

  1. sed.exe -i.bak -e "/CropBox/,/]/s#.# #g" output.pdf

这是此命令应该执行的操作:

  • < code>-i.bak 告诉 sed 内联编辑原始文件,但也创建一个带有 .bak 后缀的备份文件(以防出现问题)。
  • /CropBox/ 表示 sed 要处理的第一个地址行。
  • /]/ 表示 sed 处理的最后一个地址行。
  • s 告诉 sed 对从第一个寻址行到最后一个寻址行的所有行进行替换。
  • #.# #g 告诉 sed 进行哪种替换:将地址空间中的每个任意字符 ('.') 替换为空格 (''),全局('g')。

我们用空格替换所有字符(而不是“无”,即删除它们),否则我们会收到有关“PDF 文件损坏”的抱怨,因为对象引用计数和流长度都会发生变化。

第三步:运行 Ghostscript 命令

您已经很清楚了:

gswin32c.exe -sDEVICE=png16m -o outputImage_%03d.png output.pdf

上面的所有三个步骤都可以轻松编写脚本,我将留给您自己使用。

OK, now that revers has re-stated his problem into that he is looking for "generic code", let me try again.

The problem with a "generic code" is that there are many "legal" formal representations of "CropBox" statements which could appear in a PDF. All of the following are possible and correct and set the same values for the page's CropBox:

  • /CropBox[10 20 500 700]

  • /CropBox[ 10 20 500 700 ]

  • /CropBox[10 20 500 700 ]

  • /CropBox [10 20 500 700]

  • /CropBox [ 10 20 500 700 ]

  • /CropBox [ 10.00 20.0000 500.0 700 ]

  • /CropBox [    
              10    
              20    
              500    
              700    
             ] 

The same is true for ArtBox, TrimBox, BleedBox, CropBox and MediaBox. Therefor you need to "normalize" the *Box representation inside the PDF source code if you want to edit it.

First Step: "Normalize" the PDF source code

Here is how you do that:

  1. Download qpdf for your OS platform.
  2. Run this command on your input PDF:
    qpdf --qdf input.pdf output.pdf

The output.pdf now will have a kind of normalized structure (similar to the last example given above), and it will be easier to edit, even with a stream editor like sed.

Second Step: Remove all superfluous *Box statements

Next, you need to know that the only essential *Box is MediaBox. This one MUST be present, the others are optional (in a certain prioritized way). If the others are missing, they default to the same values as MediaBox. Therefor, in order to achieve your goal, we can simply delete all code that is related to them. We'll do it with the help of sed.

That tool is normally installed on all Linux systems -- on Windows download and install it from gnuwin32.sf.net. (Don't forget to install the named "dependencies" should you decide to use the .zip file instead of the Setup .exe).

Now run this command:

  1. sed.exe -i.bak -e "/CropBox/,/]/s#.# #g" output.pdf

Here is what this command is supposed to do:

  • -i.bak tells sed to edit the original file inline, but to also create a backup file with a.bak suffix (in case something goes wrong).
  • /CropBox/ states the first address line to be processed by sed.
  • /]/ states the last address line to be processed by sed.
  • s tells sed to do substitutions for all lines from first to last addressed line.
  • #.# #g tells sed which kind of substitution to do: replace each arbitrary character ('.') in the address space by blanks (''), globally ('g').

We substitute all characters by blanks (instead of by 'nothing', i.e. deleting them) because otherwise we'd get complaints about "PDF file corruption", since the object reference counting and the stream lengths would have changed.

Third step: run your Ghostscript command

You know that already well enough:

gswin32c.exe -sDEVICE=png16m -o outputImage_%03d.png output.pdf

All the three steps from above can easily be scripted, which I'll leave to you for your own pleasure.

老街孤人 2024-11-24 01:40:33

首先,我们要消除一个误解。你写道:

“当 pdf 裁剪框与媒体框相同时,此功能工作正常,但如果裁剪框小于媒体框,则仅显示媒体框,并且 pdf 页面的边框为迷路了。”

这是不正确的。如果 CropBox 小于 MediaBox,则仅应显示 CropBox(而不是 MediaBox)。这正是它的设计原理。这就是 CropBox 概念背后的整个想法......


目前我无法想到一个可以自动适用于每个 PDF 以及可能存在的所有可能值的解决方案(除非您想使用付费软件)。

要手动处理链接到的 PDF,请执行以下操作:

  1. 在良好的文本编辑器中打开 PDF(一种不会扰乱现有 EOL 约定,并且不会抱怨文件中的二进制部分的编辑器)。
  2. 搜索文件中包含 /CropBox 关键字的所有位置。
  3. 由于 PDF 中只有一页,因此它应该只能找到一个位置。
  4. 这可能读起来像 /CropBox [12.3456 78.9012 345.67 890.123456]
  5. 现在编辑此部分,小心避免添加(或丢失)已存在的字符数:
  6. 将值设置为您想要的字符:/CropBox [0.00000 0.00000 667.00 908.000000]。 (您可以使用空格代替我的 .0000.. 部分,但如果我这样做,SO 编辑器将吃掉它们,您将看不到我最初键入的内容...)
  7. 将文件保存在一个新名字。
  8. PDF 查看器现在应该显示完整的 MediaBox(根据您的规范)。
  9. 当您使用 Ghostscript 将新文件转换为 PNG 时,将会看到更大的页面。

First, let's get rid of a misunderstanding. You wrote:

"This is working fine when the pdf crop box is the same as the media box, but if the crop box is smaller than the media box, only the media box is displayed and the border of the pdf page is lost."

That's not correct. If the CropBox is smaller than the MediaBox, then only the CropBox should be displayed (not the MediaBox). And that is exactly how it was designed to work. This is the whole idea behind the CropBox concept...


At the moment I cannot think of a solution that works automatically for each PDF and all possibly values that can be there (unless you want to use payware).

To manually process the PDF you linked to:

  1. Open the PDF in a good text editor (one that doesn't mess with existing EOL conventions, and doesn't complain about binary parts in the file).
  2. Search for all spots in the file that contain the /CropBox keyword.
  3. Since you have only one page in the PDF, it should find only one spot.
  4. This could read like /CropBox [12.3456 78.9012 345.67 890.123456].
  5. Now edit this part, carefully avoiding to add to (or lose from) the number of already existing characters:
  6. Set the value to your wanted one: /CropBox [0.00000 0.00000 667.00 908.000000]. (You can use spaces instead of my .0000.. parts, but if I do, the SO editor will eat them and you'll not see what I originally typed...)
  7. Save the file under a new name.
  8. A PDF viewer should now show the full MediaBox (as of your specification).
  9. When you convert the new file with Ghostscript to PNG, the bigger page will be visible.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文