比较两个作为数字化传真的 PDF 文档
在我在这里发帖之前,我在板上环顾了相当多的地方,但我没有看到任何符合我希望做的事情的东西。
我们收到大量传入传真(每天 500 多页)作为单独的文档(每天大约 100 多份文档)。通常,发送者(作为医院)会在第一次尝试后几个小时重新发送同一份文件。我想将第二个发送标记为“潜在克隆”,以便可以适当地路由和标记它。
我想知道如何在每个到达的传真 (PDF/TIFF) 上计算和标记某种哈希或 ID,然后快速在我们的文档数据库中进行扫描以查看它是否唯一。
显然,如果不100%确定就没有办法,但我突然想到,如果满足以下条件,一份传真将与另一份传真相同:
- 相同的页数
- 在原始
- 哈希码的 24 小时内发送 哈希码相似(在阈值)
但我在图像比较方面有点陷入困境。我正在寻找阈值哈希代码或某种方式来表示“每个传真 p4 上的图像 95% 可能是相同的”。例如,原始传真的 p4 可能是倾斜的,但重新发送的传真的 p4 是直的。我正在考虑首先通过 Inlite Research 的 ClearImage Repair 等工具运行所有传真页面,以拉直、旋转和校准所有页面。
有人做过这样的事吗?
I did a fair bit of looking around on the board before I posted here but I didn't see anything that captured what I was hoping to do.
We receive a large number of inbound faxes (500+ pages/day) as separate documents (around 100+ documents/day). Quite often the sender (being a hospital) resends the same document a couple hours after the first try. I'd like to flag the second send as a "potential clone" so that it can be routed and flagged appropriately.
I want to know how I can compute and tag with some sort of hash or ID on each arriving fax (PDF/TIFF) then quickly do a scan in our document DB to see if it's unique or not.
Obviously there is no way without looking to be 100% sure but off the top of my head I'm thinking that one fax would be the same as another if:
- Same # of pages
- Sent within 24 hours of original
- Hash code is similar (within threshold)
But I am getting a bit bogged down on the image compare. I am looking for a threshold hash code or some way to say "the images on p4 of each fax are 95% likely to be the same". It's possible, for example, that p4 of the original fax was skewed but p4 of the resent fax is straight. I was thinking of running all the fax pages through something like Inlite Research's ClearImage Repair first to straighten, rotate, and calibrate all pages.
Has anyone done something like this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
困难在于,如果发送的第二份传真是新扫描的结果,这两个文件将具有不同的哈希值。
测量文档之间的相似性(似是而非的重复)可能需要对它们进行 OCR,或者以其他方式比较(如果是模糊方式)它们的图像内容(即解压缩后)。
编辑:对用于重复检测的哈希代码的建议
非常初步地,文档的以下属性可以组合在一些易于提供良好指示似是而非的重复的哈希值中:
应该获取这些属性对于每个单独的页面,原因是页面是明确的限制,因此通过对这些限制进行“严格”处理,我们可以允许在页面内容中进行更柔和(更模糊)的测量。
并非以下所有属性都是必需的。这些通常按照从容易获得到需要更多编程的顺序列出。
(对于每一页!)
)关于“哈希”,它应该尽可能宽,理想情况下是通过附加 32 位或 64 位哈希(每页一个)制成的可变长度哈希。
The difficulty is that if the second fax sent is the result of a new scan, the two files WILL have a distinct hash value.
Measuring a similarity (plausible duplicate) between documents would likely require to either OCR them, or otherwise compare (if a fuzzy fashion), their image content (i.e. after decompressing them).
Edit: Suggestions towards a HASH code for duplicate detection
Very tentatively, the following attributes of a document could be combined in some hash value susceptible of providing a good indication plausible duplication:
These attributes should be obtained for each individual page, the reason is that pages are unequivocal limits, so by being "hard" on these limits we can allow softer (fuzzier) measurements within the page content.
Not all the following attributes would be necessary. These are generally listed from the easier to get to the ones that require more programming.
(for each page!)
With regards to the "Hash", it should be a wide as possible, ideally a variable length hash made from appending say 32 bits or 64 bits hashes, one per page.
如果 OCR 不可行,您可以采用基于图像的方法。一种可能性是对传真图像进行下采样/过滤(以去除高频噪声),然后计算两个下采样图像的像素之间的归一化相关性。显然,还有更可靠的方法,但这可能足以标记两份传真以供手动检查。特别是如果您提到的图像修复软件可以自动定向和缩放每个页面。
If OCR is not an option, you could take an image-based approach. One possibility would be to downsample/filter the fax images (to remove high-frequency noise), then compute the normalized correlation between the pixels of the two downsampled images. Obviously, there are MUCH more robust approaches, but this might be sufficient to flag two faxes for manual inspection. Especially if the image repair software you mentioned can automatically orient and scale each page.
如果文档主要是文本,对它们进行 OCR 是个好主意。比较文本很简单。
我想,可以进行“距离”计算,但是如果第二次传真发送颠倒了怎么办?或者他们放大了它以使其更清晰?
我会尝试处理您可能遇到的文档子集,而不是应用通用算法。你会得到更好的结果,因为它不会寻找阳光下的一切。
If the documents are mostly text, OCR-ing them is a good idea. Comparing the text is straightforward.
Doing a "distance" calculation can be done, I suppose, but what if the fax is sent upside-down the second time? Or they enlarged it to make it more legible?
I'd try to tackle the subset of documents you're likely to encounter rather than applying a general algorithm. You'll get better results because it won't be looking for everything under the sun.
我认为 OpenCV 库就是您正在寻找的。如果我没记错的话它有图像相似度工具。通过地标识别和频域技术。可以在频域中进行近似散列,而不会因为图像中的微小差异而遇到太多麻烦。
I think the OpenCV library is what you're looking for. If I recall correctly it has image similarity tools. Either by landmark recognition and frequency domain techniques. It's possible to do approximate hashing in the frequency domain without having so much trouble with small differences in the images.