我如何知道 PDF 页面是彩色还是黑白?
给定一组 PDF 文件,其中一些页面是彩色的,其余页面是黑色的。 白色,是否有任何程序可以找出给定页面中哪些是彩色的,哪些是黑色和白色的? 白色的? 例如,这在打印论文时很有用,并且只需花费额外的费用来打印彩页。 对于考虑双面打印并将适当的黑白页面发送到彩色打印机(如果其反面接着有彩色页面)的人来说,这是奖励积分。
Given a set of PDF files among which some pages are color and the remaining are black & white, is there any program to find out among the given pages which are color and which are black & white? This would be useful, for instance, in printing out a thesis, and only spending extra to print the color pages. Bonus points for someone who takes into account double sided printing, and sends an appropriate black and white page to the color printer if it is are followed by a color page on the opposite side.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
这是我见过的最有趣的问题之一! 我同意其他一些帖子的观点,即渲染位图然后分析位图将是最可靠的解决方案。 对于简单的 PDF,这里有一个更快但不太完整的方法。
我下面的解决方案执行 #1 和 #2 的一半。 #2 的另一半是跟进用户定义的颜色,其中涉及查找页面中的 /ColorSpace 条目并对其进行解码——如果您对此感兴趣,请离线联系我,因为它非常可行,但在5分钟。
首先是主程序:
然后是处理每个页面上的颜色指令的辅助渲染器:
This is one of the most interesting questions I've seen! I agree with some of the other posts that rendering to a bitmap and then analyzing the bitmap will be the most reliable solution. For simple PDFs, here's a faster but less complete approach.
My solution below does #1 and half of #2. The other half of #2 would be to follow up with user-defined color, which involves looking up the /ColorSpace entries in the page and decoding them -- contact me offline if this is interesting to you, as it's very doable but not in 5 minutes.
First the main program:
And then here's the helper renderer that handles color directives on each page:
较新版本的 Ghostscript(版本 9.05 及更高版本)包含一个名为 inkcov 的“设备”。 它以青色 (C)、品红色 (M)、黄色 (Y) 和黑色 (K) 值计算每个页面(不是每个图像)的墨水覆盖率,其中 0.00000 表示 0%,1.00000 表示 100%(请参阅< em>检测所有包含颜色的页面)。
例如:
如果 CMY 值不为 0,则页面为彩色。
要仅输出包含颜色的页面,请使用这个方便的 oneliner:
Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100% (see Detecting all pages which contain color).
For example:
If the CMY values are not 0 then the page is color.
To just output the pages that contain colors use this handy oneliner:
可以使用Image Magick工具
identify
。 如果在 PDF 页面上使用,它会首先将页面转换为光栅图像。 如果页面包含颜色,可以使用-format "%[colorspace]"
选项进行测试,对于我的 PDF 打印的是Gray
或RGB
。 恕我直言,identify
(或者它在后台使用的任何工具;Ghostscript?)确实根据颜色的呈现来选择颜色空间。一个例子是:
其中 PAGE 是从 0 开始的页面,而不是 1。如果不使用页面选择,所有页面将折叠为 1,这不是您想要的。
我编写了以下 BASH 脚本,它使用 pdfinfo 来获取页数,然后循环它们。 输出彩色页面。 我还添加了双面文档的功能,您可能还需要非彩色背面页。
使用输出的空格分隔列表,可以使用 pdftk 提取彩色 PDF 页面:
It is possible to use the Image Magick tool
identify
. If used on PDF pages it converts the page first to a raster image. If the page contained color can be tested using the-format "%[colorspace]"
option, which for my PDF printed eitherGray
orRGB
. IMHOidentify
(or what ever tool it uses in the background; Ghostscript?) does choose the colorspace depending on the presents of color.An example is:
where PAGE is the page starting from 0, not 1. If the page selection is not used all pages will be collapsed to one, which is not what you want.
I wrote the following BASH script which uses
pdfinfo
to get the number of pages and then loops over them. Outputting the pages which are in color. I also added a feature for double sided document where you might need a non-colored backside page as well.Using the outputted space separated list the colored PDF pages can be extracted using
pdftk
:马丁·沙雷尔的剧本很棒。 它包含一个小错误:它计算包含颜色且直接连续两次的两个页面。 我解决了这个问题。 此外,该脚本现在还可以计算页面数并列出双页打印的灰度页面。 它还打印以逗号分隔的页面,因此输出可以直接用于从 PDF 查看器进行打印。 我已添加代码,但您也可以在此处下载它。
干杯,
时移
The script from Martin Scharrer is great. It contains a minor bug: It counts two pages which contain color and are directly consecutive twice. I fixed that. In addition the script now counts the pages and lists the grayscale pages for double-paged printing. Also it prints the pages comma separated, so the output can directly be used for printing from a PDF viewer. I've added the code, but you can download it here, too.
Cheers,
timeshift
ImageMagick 有一些内置的图像比较方法。
http://www.imagemagick.org/Usage/compare/#type_general
这里有一些用于 ImageMagick 的 Perl API,所以如果您巧妙地将它们与 PDF 到图像转换器结合起来,您可能会找到一种方法来处理您的黑白图像。 白色测试。
ImageMagick has some built-in methods for image comparison.
http://www.imagemagick.org/Usage/compare/#type_general
There are some Perl APIs for ImageMagick, so maybe if you cleverly combine these with a PDF to Image converter you can find a way to do your black & white test.
我会尝试这样做,尽管可能还有其他更简单的解决方案,而且我很好奇听到它们,我只是想尝试一下:
对于页数,您可能可以将翻译,而无需花费太多精力来翻译 Perl。 它基本上是一个正则表达式。 还表示:
要提取图像,您可以使用 ImageMagick 来执行 那个。 或者参阅此问题。
最后,要确定它是否是黑白的,这取决于您的意思是字面意义上的黑白还是灰度。 对于黑白,所有图像中应该只有黑白。 如果你想看灰度,现在,这确实不是我的专长,但我想你可以看看红色、绿色和蓝色的平均值是否彼此接近,或者原始图像和 灰度转换 两者彼此接近。
希望它能给您一些提示,帮助您走得更远。
I would try to do it like that, although there might be other easier solutions, and I'm curious to hear them, I just want to give it try:
For the page count, you can probably translate that without too much effort to Perl. It's basically a regex. It's also said that:
To extract the image, you can use ImageMagick to do that. Or see this question.
Finally, to get whether it is black and white, it depends if you mean literally black and white or grayscale. For black and white, you should only have, well, black and white in all the image. If you want to see grayscale, now, it's really not my speciality but I guess you could see if the averages of the red, the green and the blue are close to each other or if the original image and a grayscale converted one are close to each other.
Hope it gives some hints to help you go further.
这是 Windows 的 Ghostscript 解决方案,它需要 GnuWin 中的 grep (http://gnuwin32.sourceforge。 net/packages/grep.htm):
单色(黑白)页面:
gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | grep "^ 0.00000 0.00000 0.00000" | find /c /v ""
彩色页:
gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | grep -v "^ 0.00000 0.00000 0.00000" | find /c /v ""
总页数(您可以从任何 pdf 阅读器中轻松获得此页):
gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | 查找 /c /v ""
Here is the ghostscript solution for Windows, which requires grep from GnuWin (http://gnuwin32.sourceforge.net/packages/grep.htm):
Monochrome (Black and White) pages:
gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | grep "^ 0.00000 0.00000 0.00000" | find /c /v ""
Color pages:
gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | grep -v "^ 0.00000 0.00000 0.00000" | find /c /v ""
Total pages (you get this one easier from any pdf reader):
gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | find /c /v ""
这是一个改进的 Bash 单行代码,用于根据 Matteo 的答案检测彩色页面,仅在一行上给出页码:
他的原始答案不适用于某些复杂的 PDF,因为
gs
没有 < code>-q 选项很啰嗦,会在某些页面上输出不相关的文本,例如“从 /usr/share/ghostscript/9.52/Resource/Font/D050000L... 加载 D050000L 字体...”。 使用-q
时,gs
将不会输出页码,但这没关系,因为gs
无论如何都会按顺序遍历所有页面。在此答案中,
grep
查找所有不以全零开头的行并添加行(=页)号,而cut
仅选择页码。 如果您想要页码的垂直列表,这就是您所需要的。 额外的 tr 用空格替换换行符,额外的 echo 正确地用换行符结束行。Here is an improved Bash one-liner to detect colour pages based on Matteo's answer, giving only the page numbers on a single line:
His original answer does not work for some complex PDFs, because
gs
without a-q
option is chatty and will on some pages output irrelevant text such as "Loading D050000L font from /usr/share/ghostscript/9.52/Resource/Font/D050000L...". With-q
,gs
will not output page numbers, but that's fine becausegs
will go over all pages in order anyway.In this answer,
grep
finds all lines that do not start with all zeroes and adds a line (=page) number, andcut
selects only the page numbers. That's all you need if you want a vertical list of page numbers. The additionaltr
replaces the line breaks with spaces, and the extraecho
properly ends the line with a line break.