使用 PIL 检测空白页扫描

发布于 2024-10-26 13:36:17 字数 387 浏览 6 评论 0原文

因此,我经常在一台不智能的佳能多功能一体机上运行大量双面扫描作业,这给我留下了一个巨大的 JPEG 文件夹。我是否疯狂地考虑使用 PIL 来分析图像文件夹以检测空白页的扫描并将其标记为删除?

不考虑文件夹爬行和标记部分,我想这看起来像:

  • 检查图像是否是灰度的,因为这被认为是不确定的。
  • 如果是这样,请检测主要色调范围(背景颜色)。
  • 如果不是,请检测主要的色调范围,仅限于浅灰色。
  • 确定整个图像中由所述阴影组成的百分比。
  • 尝试找到一个能够充分检测带有文字、文字或图像的页面的阈值。
  • 也许一次测试图像片段以提高阈值的准确性。

我知道这是一种边缘情况,但是任何有 PIL 经验的人都可以提供一些指导吗?

So I often run huge double-sided scan jobs on an unintelligent Canon multifunction, which leaves me with a huge folder of JPEGs. Am I insane to consider using PIL to analyze a folder of images to detect scans of blank pages and flag them for deletion?

Leaving the folder-crawling and flagging parts out, I imagine this would look something like:

  • Check if the image is greyscale, as this is presumed uncertain.
  • If so, detect the dominant range of shades (background colour).
  • If not, detect the dominant range of shades, restricting to light greys.
  • Determine what percentage of the entire image is composed of said shades.
  • Try to find a threshold that adequately detects pages with type or writing or imagery.
  • Perhaps test fragments of the image at a time to increase accuracy of threshold.

I know this is sort of an edge case, but can anyone with PIL experience lend some pointers?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

何以心动 2024-11-02 13:36:17

这是一个替代解决方案,使用 mahotas

  1. 首先创建两个目录:positives/negatives/,您将在其中手动挑选一些示例。
  2. 我假设其余数据位于 unlabeled/ 目录中
  3. 计算正片和负片中所有图像的特征
  4. 学习一个分类器
  5. 在未标记图像上使用该分类器

在下面的代码中我使用了 < a href="http://luispedro.org/software/jug" rel="noreferrer">jug 让您可以在多个处理器上运行它,但是如果您删除其中的每一行,该代码也可以工作提到 TaskGenerator

from glob import glob
import mahotas
import mahotas.features
import milk
from jug import TaskGenerator


@TaskGenerator
def features_for(imname):
    img = mahotas.imread(imname)
    return mahotas.features.haralick(img).mean(0)

@TaskGenerator
def learn_model(features, labels):
    learner = milk.defaultclassifier()
    return learner.train(features, labels)

@TaskGenerator
def classify(model, features):
     return model.apply(features)

positives = glob('positives/*.jpg')
negatives = glob('negatives/*.jpg')
unlabeled = glob('unlabeled/*.jpg')


features = map(features_for, negatives + positives)
labels = [0] * len(negatives) + [1] * len(positives)

model = learn_model(features, labels)

labeled = [classify(model, features_for(u)) for u in unlabeled]

这使用了纹理功能,这可能已经足够好了,但是如果您愿意,您可以使用 mahotas.features 中的其他功能(或者尝试 mahotas .surf,但这会变得更加复杂)。一般来说,我发现很难用您正在寻找的那种硬阈值进行分类,除非扫描受到严格控制。

Here is an alternative solution, using mahotas and milk.

  1. Start by creating two directories: positives/ and negatives/ where you will manually pick out a few examples.
  2. I will assume that the rest of the data is in an unlabeled/ directory
  3. Compute features for all of the images in positives and negatives
  4. learn a classifier
  5. use that classifier on the unlabeled images

In the code below I used jug to give you the possibility of running it on multiple processors, but the code also works if you remove every line which mentions TaskGenerator

from glob import glob
import mahotas
import mahotas.features
import milk
from jug import TaskGenerator


@TaskGenerator
def features_for(imname):
    img = mahotas.imread(imname)
    return mahotas.features.haralick(img).mean(0)

@TaskGenerator
def learn_model(features, labels):
    learner = milk.defaultclassifier()
    return learner.train(features, labels)

@TaskGenerator
def classify(model, features):
     return model.apply(features)

positives = glob('positives/*.jpg')
negatives = glob('negatives/*.jpg')
unlabeled = glob('unlabeled/*.jpg')


features = map(features_for, negatives + positives)
labels = [0] * len(negatives) + [1] * len(positives)

model = learn_model(features, labels)

labeled = [classify(model, features_for(u)) for u in unlabeled]

This uses texture features, which is probably good enough, but you can play with other features in mahotas.features if you'd like (or try mahotas.surf, but that gets more complicated). In general, I have found it hard to do classification with the sort of hard thresholds you are looking for unless the scanning is very controlled.

笑忘罢 2024-11-02 13:36:17

第一次尝试时,请按文件大小对图像文件夹进行排序。如果一份文档的所有扫描件都具有相同的分辨率,则空白页肯定会生成比非空白页更小的文件。

我不知道您正在扫描多少页,但如果数量足够低,这可能是一个简单的快速修复。

Just as a first try, sort your image folder by file size. If all scans from one document have the same resolution the blank pages will certainly result in smaller files than the non-blank ones.

I don't know how many pages you are scanning, but if the number is low enough this could be a simple quick fix.

月亮邮递员 2024-11-02 13:36:17

需要考虑的一些非 PIL 特定建议:

印刷或书面材料的扫描件将具有大量高对比度的锐边;像中值滤波器(以减少噪音)之类的东西,然后进行某种简单的边缘检测,可能会很好地区分真实内容和空白页面。

一次测试片段很有用,不仅因为它可能会提高您的准确性,而且因为它可能会帮助您尽早放弃许多页面。想必您的大部分扫描件都不是空白,因此您应该从简单的检查开始,通常会将非空白页识别为非空白页;只有当它说该页面可能是空白时,您才需要更仔细地查看。

如果照明或页面本身不均匀,您可能需要首先执行类似 image = image-filter(image) 的操作,其中 filter 进行非常广泛的平滑某种。这将减少识别主色调的需要,以及在主色调在整个页面上不太均匀时应对的需要。

A few non-PIL-specific suggestions to consider:

Scans of printed or written material will have lots of high-contrast sharp edges; something like a median filter (to reduce noise) followed by some kind of simple edge detection might do a good job of discriminating real content from blank pages.

Testing fragments at a time is useful not only because it might increase your accuracy, but because it might help you to give up early on many pages. Presumably most of your scans are not blank, so you should begin with a simple-minded check that usually identifies non-blank pages as non-blank; only if it says the page might be blank do you need to look more closely.

In case either the illumination or the page itself is nonuniform, you might want to begin by doing something like image = image-filter(image) where filter does a very broad smoothing of some kind. That will reduce the need to identify the dominant shades, as well as coping when the dominant shade isn't quite uniform across the page.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文