从 TIFF 文件中删除空白(或几乎空白)页面的方法

发布于 2024-11-02 03:53:57 字数 300 浏览 6 评论 0原文

我有大约 4000 万个 TIFF 文档,全部是 1 位单页双面打印。在大约 40% 的情况下,这些 TIFF 的背面图像是“空白”,我想在加载到 CMS 之前将其删除,以减少空间需求。

有没有一种简单的方法来查看每个页面的数据内容,如果低于预设阈值(例如 2%“黑色”),则将其删除?

我对此技术不可知,但 C# 解决方案可能是最容易支持的。问题是,我没有图像处理经验,所以不知道从哪里开始。

编辑添加:这些图像是旧扫描,因此是“脏的”,因此这预计不是一门精确的科学。需要设置阈值以避免误报的可能性。

I have something like 40 million TIFF documents, all 1-bit single page duplex. In about 40% of cases, the back image of these TIFFs is 'blank' and I'd like to remove them before I do a load to a CMS to reduce space requirements.

Is there a simple method to look at the data content of each page and delete it if it falls under a preset threshold, say 2% 'black'?

I'm technology agnostic on this one, but a C# solution would probably be the easiest to support. Problem is, I've no image manipulation experience so don't really know where to start.

Edit to add: The images are old scans and so are 'dirty', so this is not expected to be an exact science. The threshold would need to be set to avoid the chance of false positives.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

薄情伤 2024-11-09 03:53:57

您可能应该:

  • 打开每个图像
  • ,迭代其页面(使用 Bitmap.GetFrameCount / Bitmap.SelectActiveFrame 方法)
  • 访问每个页面的位(使用 Bitmap.LockBits< /code> 方法)
  • 分析每个页面的内容(简单循环)
  • 如果内容值得,则将数据复制到另一个图像(Bitmap.LockBits 和循环)

此任务并不特别复杂,但需要一些要编写的代码。该站点包含一些示例,您可以使用方法名称作为关键字进行搜索)。

PS 我假设所有图像都可以成功加载到 System.Drawing.Bitmap 中。

You probably should:

  • open each image
  • iterate through its pages (using Bitmap.GetFrameCount / Bitmap.SelectActiveFrame methods)
  • access bits of each page (using Bitmap.LockBits method)
  • analyze contents of each page (simple loop)
  • if contents is worthwhile then copy data to another image (Bitmap.LockBits and a loop)

This task isn't particularly complex but will require some code to be written. This site contains some samples that you may search for using method names as keywords).

P.S. I assume that all of images can be successfully loaded into a System.Drawing.Bitmap.

杀お生予夺 2024-11-09 03:53:57

您可以使用 DotImage 执行类似的操作(免责声明,我在 Atalasoft 工作,并且编写了您需要的大部分基础类)会使用)。执行此操作的代码如下所示:

public void RemoveBlankPages(Stream source stm)
{
    List<int> blanks = new List<int>();
    if (GetBlankPages(stm, blanks)) {
        // all pages blank - delete file?  Skip?  Your choice.
    }
    else {
        // memory stream is convenient - maybe a temp file instead?
        using (MemoryStream ostm = new MemoryStream()) {
            // pulls out all the blanks and writes to the temp stream
            stm.Seek(0, SeekOrigin.Begin);
            RemoveBlanks(blanks, stm, ostm);
            CopyStream(ostm, stm); // copies first stm to second, truncating at end
        }
    }
}

private bool GetBlankPages(Stream stm, List<int> blanks)
{
    TiffDecoder decoder = new TiffDecoder();
    ImageInfo info = decoder.GetImageInfo(stm);
    for (int i=0; i < info.FrameCount; i++) {
        try {
            stm.Seek(0, SeekOrigin.Begin);
            using (AtalaImage image = decoder.Read(stm, i, null)) {
                if (IsBlankPage(image)) blanks.Add(i);
            }
        }
        catch {
            // bad file - skip? could also try to remove the bad page:
            blanks.Add(i);
        }
    }
    return blanks.Count == info.FrameCount;
}

private bool IsBlankPage(AtalaImage image)
{
    // you might want to configure the command to do noise removal and black border
    // removal (or not) first.
    BlankPageDetectionCommand command = new BlankPageDetectionCommand();
    BlankPageDetectionResults results = command.Apply(image) as BlankPageDetectionResults;
    return results.IsImageBlank;
}

private void RemoveBlanks(List<int> blanks, Stream source, Stream dest)
{
    // blanks needs to be sorted low to high, which it will be if generated from
    // above
    TiffDocument doc = new TiffDocument(source);
    int totalRemoved = 0;
    foreach (int page in blanks) {
        doc.Pages.RemoveAt(page - totalRemoved);
        totalRemoved++;
    }
    doc.Save(dest);
}

您应该注意到,空白页检测并不像“所有像素都是白色的(-ish)吗?”那么简单。因为扫描会引入各种有趣的伪影。要获取 BlankPageDetectionCommand,您需要 Document Imaging 包。

You can do something like that with DotImage (disclaimer, I work for Atalasoft and have written most of the underlying classes that you'd be using). The code to do it will look something like this:

public void RemoveBlankPages(Stream source stm)
{
    List<int> blanks = new List<int>();
    if (GetBlankPages(stm, blanks)) {
        // all pages blank - delete file?  Skip?  Your choice.
    }
    else {
        // memory stream is convenient - maybe a temp file instead?
        using (MemoryStream ostm = new MemoryStream()) {
            // pulls out all the blanks and writes to the temp stream
            stm.Seek(0, SeekOrigin.Begin);
            RemoveBlanks(blanks, stm, ostm);
            CopyStream(ostm, stm); // copies first stm to second, truncating at end
        }
    }
}

private bool GetBlankPages(Stream stm, List<int> blanks)
{
    TiffDecoder decoder = new TiffDecoder();
    ImageInfo info = decoder.GetImageInfo(stm);
    for (int i=0; i < info.FrameCount; i++) {
        try {
            stm.Seek(0, SeekOrigin.Begin);
            using (AtalaImage image = decoder.Read(stm, i, null)) {
                if (IsBlankPage(image)) blanks.Add(i);
            }
        }
        catch {
            // bad file - skip? could also try to remove the bad page:
            blanks.Add(i);
        }
    }
    return blanks.Count == info.FrameCount;
}

private bool IsBlankPage(AtalaImage image)
{
    // you might want to configure the command to do noise removal and black border
    // removal (or not) first.
    BlankPageDetectionCommand command = new BlankPageDetectionCommand();
    BlankPageDetectionResults results = command.Apply(image) as BlankPageDetectionResults;
    return results.IsImageBlank;
}

private void RemoveBlanks(List<int> blanks, Stream source, Stream dest)
{
    // blanks needs to be sorted low to high, which it will be if generated from
    // above
    TiffDocument doc = new TiffDocument(source);
    int totalRemoved = 0;
    foreach (int page in blanks) {
        doc.Pages.RemoveAt(page - totalRemoved);
        totalRemoved++;
    }
    doc.Save(dest);
}

You should note that blank page detection is not as simple as "are all the pixels white(-ish)?" since scanning introduces all kinds of interesting artifacts. To get the BlankPageDetectionCommand, you would need the Document Imaging package.

许仙没带伞 2024-11-09 03:53:57

您是否有兴趣缩小文件或只是想避免人们浪费时间查看空白页面?您可以对文件进行快速而肮脏的编辑,只需将第二个 IFD 修补为 0x00000000 即可消除已知的空白页。这就是我的意思 - 如果您只是浏览页面,TIFF 文件具有简单的布局:

TIFF 标头(4 字节)
第一个 IFD 偏移量(4 字节 - 通常指向 0x00000008)

IFD:

标签数量(2 字节)

{单个 TIFF 标签}(每个 12 字节)

下一个 IFD 偏移量(4 字节)

只需将“下一个 IFD 偏移量”修补为值 0x00000000 用于“取消链接”当前页面之外的页面。

Are you interested in shrinking the files or just want to avoid people wasting their time viewing blank pages? You can do a quick and dirty edit of the files to rid yourself of known blank pages by just patching the second IFD to be 0x00000000. Here's what I mean - TIFF files have a simple layout if you're just navigating through the pages:

TIFF Header (4 bytes)
First IFD offset (4 bytes - typically points to 0x00000008)

IFD:

Number of tags (2-bytes)

{individual TIFF tags} (12-bytes each)

Next IFD offset (4 bytes)

Just patch the "next IFD offset" to a value of 0x00000000 to "unlink" pages beyond the current one.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文