当前位置：文江博客话题详情

从 TIFF 文件中删除空白（或几乎空白）页面的方法

发布于 2024-11-02 03:53:57 字数 300 浏览 7 评论 0原文

我有大约 4000 万个 TIFF 文档，全部是 1 位单页双面打印。在大约 40% 的情况下，这些 TIFF 的背面图像是“空白”，我想在加载到 CMS 之前将其删除，以减少空间需求。

有没有一种简单的方法来查看每个页面的数据内容，如果低于预设阈值（例如 2％“黑色”），则将其删除？

我对此技术不可知，但 C# 解决方案可能是最容易支持的。问题是，我没有图像处理经验，所以不知道从哪里开始。

编辑添加：这些图像是旧扫描，因此是“脏的”，因此这预计不是一门精确的科学。需要设置阈值以避免误报的可能性。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄情伤 2024-11-09 03:53:57

您可能应该：

打开每个图像
，迭代其页面（使用 Bitmap.GetFrameCount / Bitmap.SelectActiveFrame 方法）
访问每个页面的位（使用 Bitmap.LockBits< /code> 方法）
分析每个页面的内容（简单循环）
如果内容值得，则将数据复制到另一个图像（Bitmap.LockBits 和循环）

此任务并不特别复杂，但需要一些要编写的代码。该站点包含一些示例，您可以使用方法名称作为关键字进行搜索）。

PS 我假设所有图像都可以成功加载到 System.Drawing.Bitmap 中。

回复收藏 0 原文

杀お生予夺 2024-11-09 03:53:57

您可以使用 DotImage 执行类似的操作（免责声明，我在 Atalasoft 工作，并且编写了您需要的大部分基础类）会使用）。执行此操作的代码如下所示：

public void RemoveBlankPages(Stream source stm)
{
    List<int> blanks = new List<int>();
    if (GetBlankPages(stm, blanks)) {
        // all pages blank - delete file?  Skip?  Your choice.
    }
    else {
        // memory stream is convenient - maybe a temp file instead?
        using (MemoryStream ostm = new MemoryStream()) {
            // pulls out all the blanks and writes to the temp stream
            stm.Seek(0, SeekOrigin.Begin);
            RemoveBlanks(blanks, stm, ostm);
            CopyStream(ostm, stm); // copies first stm to second, truncating at end
        }
    }
}

private bool GetBlankPages(Stream stm, List<int> blanks)
{
    TiffDecoder decoder = new TiffDecoder();
    ImageInfo info = decoder.GetImageInfo(stm);
    for (int i=0; i < info.FrameCount; i++) {
        try {
            stm.Seek(0, SeekOrigin.Begin);
            using (AtalaImage image = decoder.Read(stm, i, null)) {
                if (IsBlankPage(image)) blanks.Add(i);
            }
        }
        catch {
            // bad file - skip? could also try to remove the bad page:
            blanks.Add(i);
        }
    }
    return blanks.Count == info.FrameCount;
}

private bool IsBlankPage(AtalaImage image)
{
    // you might want to configure the command to do noise removal and black border
    // removal (or not) first.
    BlankPageDetectionCommand command = new BlankPageDetectionCommand();
    BlankPageDetectionResults results = command.Apply(image) as BlankPageDetectionResults;
    return results.IsImageBlank;
}

private void RemoveBlanks(List<int> blanks, Stream source, Stream dest)
{
    // blanks needs to be sorted low to high, which it will be if generated from
    // above
    TiffDocument doc = new TiffDocument(source);
    int totalRemoved = 0;
    foreach (int page in blanks) {
        doc.Pages.RemoveAt(page - totalRemoved);
        totalRemoved++;
    }
    doc.Save(dest);
}

您应该注意到，空白页检测并不像“所有像素都是白色的（-ish）吗？”那么简单。因为扫描会引入各种有趣的伪影。要获取 BlankPageDetectionCommand，您需要 Document Imaging 包。

You can do something like that with DotImage (disclaimer, I work for Atalasoft and have written most of the underlying classes that you'd be using). The code to do it will look something like this:

public void RemoveBlankPages(Stream source stm)
{
    List<int> blanks = new List<int>();
    if (GetBlankPages(stm, blanks)) {
        // all pages blank - delete file?  Skip?  Your choice.
    }
    else {
        // memory stream is convenient - maybe a temp file instead?
        using (MemoryStream ostm = new MemoryStream()) {
            // pulls out all the blanks and writes to the temp stream
            stm.Seek(0, SeekOrigin.Begin);
            RemoveBlanks(blanks, stm, ostm);
            CopyStream(ostm, stm); // copies first stm to second, truncating at end
        }
    }
}

private bool GetBlankPages(Stream stm, List<int> blanks)
{
    TiffDecoder decoder = new TiffDecoder();
    ImageInfo info = decoder.GetImageInfo(stm);
    for (int i=0; i < info.FrameCount; i++) {
        try {
            stm.Seek(0, SeekOrigin.Begin);
            using (AtalaImage image = decoder.Read(stm, i, null)) {
                if (IsBlankPage(image)) blanks.Add(i);
            }
        }
        catch {
            // bad file - skip? could also try to remove the bad page:
            blanks.Add(i);
        }
    }
    return blanks.Count == info.FrameCount;
}

private bool IsBlankPage(AtalaImage image)
{
    // you might want to configure the command to do noise removal and black border
    // removal (or not) first.
    BlankPageDetectionCommand command = new BlankPageDetectionCommand();
    BlankPageDetectionResults results = command.Apply(image) as BlankPageDetectionResults;
    return results.IsImageBlank;
}

private void RemoveBlanks(List<int> blanks, Stream source, Stream dest)
{
    // blanks needs to be sorted low to high, which it will be if generated from
    // above
    TiffDocument doc = new TiffDocument(source);
    int totalRemoved = 0;
    foreach (int page in blanks) {
        doc.Pages.RemoveAt(page - totalRemoved);
        totalRemoved++;
    }
    doc.Save(dest);
}

You should note that blank page detection is not as simple as "are all the pixels white(-ish)?" since scanning introduces all kinds of interesting artifacts. To get the BlankPageDetectionCommand, you would need the Document Imaging package.

回复收藏 0 原文