从 TIFF 文件中删除空白(或几乎空白)页面的方法
我有大约 4000 万个 TIFF 文档,全部是 1 位单页双面打印。在大约 40% 的情况下,这些 TIFF 的背面图像是“空白”,我想在加载到 CMS 之前将其删除,以减少空间需求。
有没有一种简单的方法来查看每个页面的数据内容,如果低于预设阈值(例如 2%“黑色”),则将其删除?
我对此技术不可知,但 C# 解决方案可能是最容易支持的。问题是,我没有图像处理经验,所以不知道从哪里开始。
编辑添加:这些图像是旧扫描,因此是“脏的”,因此这预计不是一门精确的科学。需要设置阈值以避免误报的可能性。
I have something like 40 million TIFF documents, all 1-bit single page duplex. In about 40% of cases, the back image of these TIFFs is 'blank' and I'd like to remove them before I do a load to a CMS to reduce space requirements.
Is there a simple method to look at the data content of each page and delete it if it falls under a preset threshold, say 2% 'black'?
I'm technology agnostic on this one, but a C# solution would probably be the easiest to support. Problem is, I've no image manipulation experience so don't really know where to start.
Edit to add: The images are old scans and so are 'dirty', so this is not expected to be an exact science. The threshold would need to be set to avoid the chance of false positives.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能应该:
Bitmap.LockBits
和循环)此任务并不特别复杂,但需要一些要编写的代码。该站点包含一些示例,您可以使用方法名称作为关键字进行搜索)。
PS 我假设所有图像都可以成功加载到 System.Drawing.Bitmap 中。
You probably should:
Bitmap.GetFrameCount
/Bitmap.SelectActiveFrame
methods)Bitmap.LockBits
method)Bitmap.LockBits
and a loop)This task isn't particularly complex but will require some code to be written. This site contains some samples that you may search for using method names as keywords).
P.S. I assume that all of images can be successfully loaded into a
System.Drawing.Bitmap
.您可以使用 DotImage 执行类似的操作(免责声明,我在 Atalasoft 工作,并且编写了您需要的大部分基础类)会使用)。执行此操作的代码如下所示:
您应该注意到,空白页检测并不像“所有像素都是白色的(-ish)吗?”那么简单。因为扫描会引入各种有趣的伪影。要获取 BlankPageDetectionCommand,您需要 Document Imaging 包。
You can do something like that with DotImage (disclaimer, I work for Atalasoft and have written most of the underlying classes that you'd be using). The code to do it will look something like this:
You should note that blank page detection is not as simple as "are all the pixels white(-ish)?" since scanning introduces all kinds of interesting artifacts. To get the BlankPageDetectionCommand, you would need the Document Imaging package.
您是否有兴趣缩小文件或只是想避免人们浪费时间查看空白页面?您可以对文件进行快速而肮脏的编辑,只需将第二个 IFD 修补为 0x00000000 即可消除已知的空白页。这就是我的意思 - 如果您只是浏览页面,TIFF 文件具有简单的布局:
TIFF 标头(4 字节)
第一个 IFD 偏移量(4 字节 - 通常指向 0x00000008)
IFD:
标签数量(2 字节)
{单个 TIFF 标签}(每个 12 字节)
下一个 IFD 偏移量(4 字节)
只需将“下一个 IFD 偏移量”修补为值 0x00000000 用于“取消链接”当前页面之外的页面。
Are you interested in shrinking the files or just want to avoid people wasting their time viewing blank pages? You can do a quick and dirty edit of the files to rid yourself of known blank pages by just patching the second IFD to be 0x00000000. Here's what I mean - TIFF files have a simple layout if you're just navigating through the pages:
TIFF Header (4 bytes)
First IFD offset (4 bytes - typically points to 0x00000008)
IFD:
Number of tags (2-bytes)
{individual TIFF tags} (12-bytes each)
Next IFD offset (4 bytes)
Just patch the "next IFD offset" to a value of 0x00000000 to "unlink" pages beyond the current one.