我如何知道 PDF 页面是彩色还是黑白？

发布于 2024-07-15 09:49:29 字数 169 浏览 14 评论 0原文

给定一组 PDF 文件，其中一些页面是彩色的，其余页面是黑色的。白色，是否有任何程序可以找出给定页面中哪些是彩色的，哪些是黑色和白色的？白色的？例如，这在打印论文时很有用，并且只需花费额外的费用来打印彩页。对于考虑双面打印并将适当的黑白页面发送到彩色打印机（如果其反面接着有彩色页面）的人来说，这是奖励积分。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

弃爱 2024-07-22 09:49:29

这是我见过的最有趣的问题之一！我同意其他一些帖子的观点，即渲染位图然后分析位图将是最可靠的解决方案。对于简单的 PDF，这里有一个更快但不太完整的方法。

解析每个 PDF 页面
查找颜色指令（g、rg、k、sc、scn 等）
查找嵌入图像，分析颜色

我下面的解决方案执行 #1 和 #2 的一半。 #2 的另一半是跟进用户定义的颜色，其中涉及查找页面中的 /ColorSpace 条目并对其进行解码——如果您对此感兴趣，请离线联系我，因为它非常可行，但在5分钟。

首先是主程序：

use CAM::PDF;

my $infile = shift;
my $pdf = CAM::PDF->new($infile);
PAGE:
for my $p (1 .. $pdf->numPages) {
   my $tree = $pdf->getPageContentTree($p);
   if (!$tree) {
      print "Failed to parse page $p\n";
      next PAGE;
   }
   my $colors = $tree->traverse('My::Renderer::FindColors')->{colors};
   my $uncertain = 0;
   for my $color (@{$colors}) {
      my ($name, @rest) = @{$color};
      if ($name eq 'g') {
      } elsif ($name eq 'rgb') {
         my ($r, $g, $b) = @rest;
         if ($r != $g || $r != $b) {
            print "Page $p is color\n";
            next PAGE;
         }
      } elsif ($name eq 'cmyk') {
         my ($c, $m, $y, $k) = @rest;
         if ($c != 0 || $m != 0 || $y != 0) {
            print "Page $p is color\n";
            next PAGE;
         }
      } else {
         $uncertain = $name;
      }
   }
   if ($uncertain) {
      print "Page $p has user-defined color ($uncertain), needs more investigation\n";
   } else {
      print "Page $p is grayscale\n";
   }
}

然后是处理每个页面上的颜色指令的辅助渲染器：

package My::Renderer::FindColors;

sub new {
   my $pkg = shift;
   return bless { colors => [] }, $pkg;
}
sub clone {
   my $self = shift;
   my $pkg = ref $self;
   return bless { colors => $self->{colors}, cs => $self->{cs}, CS => $self->{CS} }, $pkg;
}
sub rg {
   my ($self, $r, $g, $b) = @_;
   push @{$self->{colors}}, ['rgb', $r, $g, $b];
}
sub g {
   my ($self, $gray) = @_;
   push @{$self->{colors}}, ['rgb', $gray, $gray, $gray];
}
sub k {
   my ($self, $c, $m, $y, $k) = @_;
   push @{$self->{colors}}, ['cmyk', $c, $m, $y, $k];
}
sub cs {
   my ($self, $name) = @_;
   $self->{cs} = $name;
}
sub cs {
   my ($self, $name) = @_;
   $self->{CS} = $name;
}
sub _sc {
   my ($self, $cs, @rest) = @_;
   return if !$cs; # syntax error                                                                                             
   if ($cs eq 'DeviceRGB') { $self->rg(@rest); }
   elsif ($cs eq 'DeviceGray') { $self->g(@rest); }
   elsif ($cs eq 'DeviceCMYK') { $self->k(@rest); }
   else { push @{$self->{colors}}, [$cs, @rest]; }
}
sub sc {
   my ($self, @rest) = @_;
   $self->_sc($self->{cs}, @rest);
}
sub SC {
   my ($self, @rest) = @_;
   $self->_sc($self->{CS}, @rest);
}
sub scn { sc(@_); }
sub SCN { SC(@_); }
sub RG { rg(@_); }
sub G { g(@_); }
sub K { k(@_); }

This is one of the most interesting questions I've seen! I agree with some of the other posts that rendering to a bitmap and then analyzing the bitmap will be the most reliable solution. For simple PDFs, here's a faster but less complete approach.

Parse each PDF page
Look for color directives (g, rg, k, sc, scn, etc)
Look for embedded images, analyze for color

My solution below does #1 and half of #2. The other half of #2 would be to follow up with user-defined color, which involves looking up the /ColorSpace entries in the page and decoding them -- contact me offline if this is interesting to you, as it's very doable but not in 5 minutes.

First the main program:

use CAM::PDF;

my $infile = shift;
my $pdf = CAM::PDF->new($infile);
PAGE:
for my $p (1 .. $pdf->numPages) {
   my $tree = $pdf->getPageContentTree($p);
   if (!$tree) {
      print "Failed to parse page $p\n";
      next PAGE;
   }
   my $colors = $tree->traverse('My::Renderer::FindColors')->{colors};
   my $uncertain = 0;
   for my $color (@{$colors}) {
      my ($name, @rest) = @{$color};
      if ($name eq 'g') {
      } elsif ($name eq 'rgb') {
         my ($r, $g, $b) = @rest;
         if ($r != $g || $r != $b) {
            print "Page $p is color\n";
            next PAGE;
         }
      } elsif ($name eq 'cmyk') {
         my ($c, $m, $y, $k) = @rest;
         if ($c != 0 || $m != 0 || $y != 0) {
            print "Page $p is color\n";
            next PAGE;
         }
      } else {
         $uncertain = $name;
      }
   }
   if ($uncertain) {
      print "Page $p has user-defined color ($uncertain), needs more investigation\n";
   } else {
      print "Page $p is grayscale\n";
   }
}

And then here's the helper renderer that handles color directives on each page:

package My::Renderer::FindColors;

sub new {
   my $pkg = shift;
   return bless { colors => [] }, $pkg;
}
sub clone {
   my $self = shift;
   my $pkg = ref $self;
   return bless { colors => $self->{colors}, cs => $self->{cs}, CS => $self->{CS} }, $pkg;
}
sub rg {
   my ($self, $r, $g, $b) = @_;
   push @{$self->{colors}}, ['rgb', $r, $g, $b];
}
sub g {
   my ($self, $gray) = @_;
   push @{$self->{colors}}, ['rgb', $gray, $gray, $gray];
}
sub k {
   my ($self, $c, $m, $y, $k) = @_;
   push @{$self->{colors}}, ['cmyk', $c, $m, $y, $k];
}
sub cs {
   my ($self, $name) = @_;
   $self->{cs} = $name;
}
sub cs {
   my ($self, $name) = @_;
   $self->{CS} = $name;
}
sub _sc {
   my ($self, $cs, @rest) = @_;
   return if !$cs; # syntax error                                                                                             
   if ($cs eq 'DeviceRGB') { $self->rg(@rest); }
   elsif ($cs eq 'DeviceGray') { $self->g(@rest); }
   elsif ($cs eq 'DeviceCMYK') { $self->k(@rest); }
   else { push @{$self->{colors}}, [$cs, @rest]; }
}
sub sc {
   my ($self, @rest) = @_;
   $self->_sc($self->{cs}, @rest);
}
sub SC {
   my ($self, @rest) = @_;
   $self->_sc($self->{CS}, @rest);
}
sub scn { sc(@_); }
sub SCN { SC(@_); }
sub RG { rg(@_); }
sub G { g(@_); }
sub K { k(@_); }

回复收藏 0 原文

走过海棠暮 2024-07-22 09:49:29

较新版本的 Ghostscript（版本 9.05 及更高版本）包含一个名为 inkcov 的“设备”。它以青色 (C)、品红色 (M)、黄色 (Y) 和黑色 (K) 值计算每个页面（不是每个图像）的墨水覆盖率，其中 0.00000 表示 0%，1.00000 表示 100%（请参阅< em>检测所有包含颜色的页面）。

例如：

$ gs -q -o - -sDEVICE=inkcov file.pdf 
0.11264  0.11605  0.11605  0.09364 CMYK OK
0.11260  0.11601  0.11601  0.09360 CMYK OK

如果 CMY 值不为 0，则页面为彩色。

要仅输出包含颜色的页面，请使用这个方便的 oneliner：

$ gs -o - -sDEVICE=inkcov file.pdf |tail -n +4 |sed '/^Page*/N;s/\n//'|sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000  / d'

Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100% (see Detecting all pages which contain color).

For example:

$ gs -q -o - -sDEVICE=inkcov file.pdf 
0.11264  0.11605  0.11605  0.09364 CMYK OK
0.11260  0.11601  0.11601  0.09360 CMYK OK

If the CMY values are not 0 then the page is color.

To just output the pages that contain colors use this handy oneliner:

$ gs -o - -sDEVICE=inkcov file.pdf |tail -n +4 |sed '/^Page*/N;s/\n//'|sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000  / d'

回复收藏 0 原文

会发光的星星闪亮亮i 2024-07-22 09:49:29

可以使用Image Magick工具identify。如果在 PDF 页面上使用，它会首先将页面转换为光栅图像。如果页面包含颜色，可以使用 -format "%[colorspace]" 选项进行测试，对于我的 PDF 打印的是 Gray 或 RGB 。恕我直言，identify（或者它在后台使用的任何工具；Ghostscript？）确实根据颜色的呈现来选择颜色空间。

一个例子是：

identify -format "%[colorspace]" $FILE.pdf[$PAGE]

其中 PAGE 是从 0 开始的页面，而不是 1。如果不使用页面选择，所有页面将折叠为 1，这不是您想要的。

我编写了以下 BASH 脚本，它使用 pdfinfo 来获取页数，然后循环它们。输出彩色页面。我还添加了双面文档的功能，您可能还需要非彩色背面页。

使用输出的空格分隔列表，可以使用 pdftk 提取彩色 PDF 页面：

pdftk $FILE cat $PAGELIST output color_${FILE}.pdf

#!/bin/bash

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES $N"
    else
        COLORPAGES="$COLORPAGES $N"
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $N $((N+1))"
            #N=$((N+1))
        else
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $((N-1)) $N"
        fi
    fi
    N=$((N+1))
done

echo $DOUBLECOLORPAGES
echo $COLORPAGES
echo $GRAYPAGES
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf

It is possible to use the Image Magick tool identify. If used on PDF pages it converts the page first to a raster image. If the page contained color can be tested using the -format "%[colorspace]" option, which for my PDF printed either Gray or RGB. IMHO identify (or what ever tool it uses in the background; Ghostscript?) does choose the colorspace depending on the presents of color.

An example is:

identify -format "%[colorspace]" $FILE.pdf[$PAGE]

where PAGE is the page starting from 0, not 1. If the page selection is not used all pages will be collapsed to one, which is not what you want.

I wrote the following BASH script which uses pdfinfo to get the number of pages and then loops over them. Outputting the pages which are in color. I also added a feature for double sided document where you might need a non-colored backside page as well.

Using the outputted space separated list the colored PDF pages can be extracted using pdftk:

pdftk $FILE cat $PAGELIST output color_${FILE}.pdf

#!/bin/bash

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES $N"
    else
        COLORPAGES="$COLORPAGES $N"
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $N $((N+1))"
            #N=$((N+1))
        else
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $((N-1)) $N"
        fi
    fi
    N=$((N+1))
done

echo $DOUBLECOLORPAGES
echo $COLORPAGES
echo $GRAYPAGES
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf

回复收藏 0 原文

海之角 2024-07-22 09:49:29

马丁·沙雷尔的剧本很棒。它包含一个小错误：它计算包含颜色且直接连续两次的两个页面。我解决了这个问题。此外，该脚本现在还可以计算页面数并列出双页打印的灰度页面。它还打印以逗号分隔的页面，因此输出可以直接用于从 PDF 查看器进行打印。我已添加代码，但您也可以在此处下载它。

干杯，
时移

#!/bin/bash

if [ $# -ne 1 ] 
then
    echo "USAGE: This script needs exactly one paramter: the path to the PDF"
    kill -SIGINT $
fi

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""
DOUBLEGRAYPAGES=""
OLDGP=""
DOUBLEPAGE=0
DPGC=0
DPCC=0
SPGC=0
SPCC=0

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $DOUBLEPAGE -eq -1 ]]
    then
    DOUBLEGRAYPAGES="$OLDGP"
    DPGC=$((DPGC-1))
    DOUBLEPAGE=0
    fi
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES,$N"
    SPGC=$((SPGC+1))
    if [[ $DOUBLEPAGE -eq 0 ]]
    then
        OLDGP="$DOUBLEGRAYPAGES"
        DOUBLEGRAYPAGES="$DOUBLEGRAYPAGES,$N"
        DPGC=$((DPGC+1))
    else 
        DOUBLEPAGE=0
    fi
    else
        COLORPAGES="$COLORPAGES,$N"
    SPCC=$((SPCC+1))
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$N,$((N+1))"
        DOUBLEPAGE=$((N+1))
        DPCC=$((DPCC+2))
            #N=$((N+1))
        else
        if [[ $DOUBLEPAGE -eq 0 ]]
        then 
                DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$((N-1)),$N"
        DPCC=$((DPCC+2))
        DOUBLEPAGE=-1
        elif [[ $DOUBLEPAGE -gt 0 ]]
        then
        DOUBLEPAGE=0            
        fi                      
        fi
    fi
    N=$((N+1))
done

echo " "
echo "Double-paged printing:"
echo "  Color($DPCC): ${DOUBLECOLORPAGES:1:${#DOUBLECOLORPAGES}-1}"
echo "  Gray($DPGC): ${DOUBLEGRAYPAGES:1:${#DOUBLEGRAYPAGES}-1}"
echo " "
echo "Single-paged printing:"
echo "  Color($SPCC): ${COLORPAGES:1:${#COLORPAGES}-1}"
echo "  Gray($SPGC): ${GRAYPAGES:1:${#GRAYPAGES}-1}"
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf

The script from Martin Scharrer is great. It contains a minor bug: It counts two pages which contain color and are directly consecutive twice. I fixed that. In addition the script now counts the pages and lists the grayscale pages for double-paged printing. Also it prints the pages comma separated, so the output can directly be used for printing from a PDF viewer. I've added the code, but you can download it here, too.

Cheers,
timeshift

#!/bin/bash

if [ $# -ne 1 ] 
then
    echo "USAGE: This script needs exactly one paramter: the path to the PDF"
    kill -SIGINT $
fi

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""
DOUBLEGRAYPAGES=""
OLDGP=""
DOUBLEPAGE=0
DPGC=0
DPCC=0
SPGC=0
SPCC=0

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $DOUBLEPAGE -eq -1 ]]
    then
    DOUBLEGRAYPAGES="$OLDGP"
    DPGC=$((DPGC-1))
    DOUBLEPAGE=0
    fi
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES,$N"
    SPGC=$((SPGC+1))
    if [[ $DOUBLEPAGE -eq 0 ]]
    then
        OLDGP="$DOUBLEGRAYPAGES"
        DOUBLEGRAYPAGES="$DOUBLEGRAYPAGES,$N"
        DPGC=$((DPGC+1))
    else 
        DOUBLEPAGE=0
    fi
    else
        COLORPAGES="$COLORPAGES,$N"
    SPCC=$((SPCC+1))
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$N,$((N+1))"
        DOUBLEPAGE=$((N+1))
        DPCC=$((DPCC+2))
            #N=$((N+1))
        else
        if [[ $DOUBLEPAGE -eq 0 ]]
        then 
                DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$((N-1)),$N"
        DPCC=$((DPCC+2))
        DOUBLEPAGE=-1
        elif [[ $DOUBLEPAGE -gt 0 ]]
        then
        DOUBLEPAGE=0            
        fi                      
        fi
    fi
    N=$((N+1))
done

echo " "
echo "Double-paged printing:"
echo "  Color($DPCC): ${DOUBLECOLORPAGES:1:${#DOUBLECOLORPAGES}-1}"
echo "  Gray($DPGC): ${DOUBLEGRAYPAGES:1:${#DOUBLEGRAYPAGES}-1}"
echo " "
echo "Single-paged printing:"
echo "  Color($SPCC): ${COLORPAGES:1:${#COLORPAGES}-1}"
echo "  Gray($SPGC): ${GRAYPAGES:1:${#GRAYPAGES}-1}"
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf

回复收藏 0 原文

囚我心虐我身 2024-07-22 09:49:29

ImageMagick 有一些内置的图像比较方法。

http://www.imagemagick.org/Usage/compare/#type_general

这里有一些用于 ImageMagick 的 Perl API，所以如果您巧妙地将它们与 PDF 到图像转换器结合起来，您可能会找到一种方法来处理您的黑白图像。白色测试。

回复收藏 0 原文

忘你却要生生世世 2024-07-22 09:49:29

我会尝试这样做，尽管可能还有其他更简单的解决方案，而且我很好奇听到它们，我只是想尝试一下：

循环所有页面
将页面提取到图像
验证颜色范围image

对于页数，您可能可以将翻译，而无需花费太多精力来翻译 Perl。它基本上是一个正则表达式。还表示：

r"(/类型)\s?(/页面)[/>\s]"
你只需数一下有多少
该正则表达式出现的次数
在 PDF 文件中，减去您的时间
找到字符串“<>”
（未渲染的空年龄）。

要提取图像，您可以使用 ImageMagick 来执行那个。或者参阅此问题。

最后，要确定它是否是黑白的，这取决于您的意思是字面意义上的黑白还是灰度。对于黑白，所有图像中应该只有黑白。如果你想看灰度，现在，这确实不是我的专长，但我想你可以看看红色、绿色和蓝色的平均值是否彼此接近，或者原始图像和灰度转换两者彼此接近。

希望它能给您一些提示，帮助您走得更远。

回复收藏 0 原文

秋风の叶未落 2024-07-22 09:49:29

这是 Windows 的 Ghostscript 解决方案，它需要 GnuWin 中的 grep (http://gnuwin32.sourceforge。 net/packages/grep.htm)：

单色（黑白）页面：

gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | grep "^ 0.00000 0.00000 0.00000" | find /c /v ""

彩色页：

gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | grep -v "^ 0.00000 0.00000 0.00000" | find /c /v ""

总页数（您可以从任何 pdf 阅读器中轻松获得此页）：

gswin64c -q -o - -sDEVICE=inkcov DOCUMENT.pdf | 查找 /c /v ""

回复收藏 0 原文

可爱暴击 2024-07-22 09:49:29

这是一个改进的 Bash 单行代码，用于根据 Matteo 的答案检测彩色页面，仅在一行上给出页码：

gs -q -o - -sDEVICE=inkcov file.pdf | grep -vn "^ 0.00000  0.00000  0.00000" | cut -d ':' -f 1 | tr '\n' ' ' && echo

他的原始答案不适用于某些复杂的 PDF，因为 gs 没有 < code>-q 选项很啰嗦，会在某些页面上输出不相关的文本，例如“从 /usr/share/ghostscript/9.52/Resource/Font/D050000L... 加载 D050000L 字体...”。使用 -q 时，gs 将不会输出页码，但这没关系，因为 gs 无论如何都会按顺序遍历所有页面。

在此答案中，grep 查找所有不以全零开头的行并添加行（=页）号，而 cut 仅选择页码。如果您想要页码的垂直列表，这就是您所需要的。额外的 tr 用空格替换换行符，额外的 echo 正确地用换行符结束行。

Here is an improved Bash one-liner to detect colour pages based on Matteo's answer, giving only the page numbers on a single line:

gs -q -o - -sDEVICE=inkcov file.pdf | grep -vn "^ 0.00000  0.00000  0.00000" | cut -d ':' -f 1 | tr '\n' ' ' && echo

His original answer does not work for some complex PDFs, because gs without a -q option is chatty and will on some pages output irrelevant text such as "Loading D050000L font from /usr/share/ghostscript/9.52/Resource/Font/D050000L...". With -q, gs will not output page numbers, but that's fine because gs will go over all pages in order anyway.

In this answer, grep finds all lines that do not start with all zeroes and adds a line (=page) number, and cut selects only the page numbers. That's all you need if you want a vertical list of page numbers. The additional tr replaces the line breaks with spaces, and the extra echo properly ends the line with a line break.

回复收藏 0 原文

~没有更多了~