检查两个图像文件是否相同..校验和或哈希值?
我正在做一些图像处理代码,其中我从 URL 下载一些图像(如 BufferedImage)并将其传递到图像处理器。
我想避免将同一图像多次传递给图像处理器(因为图像处理操作的成本很高)。图像的 URL 端点(如果它们是相同的图像)可能会有所不同,因此我可以通过 URL 来防止这种情况。因此,我计划进行校验和或哈希来确定代码是否再次遇到相同的图像。
对于 md5,我尝试了 Fast MD5,它为图像生成了 20K+ 字符长度的十六进制校验和值(一些示例) 。显然,在数据库存储方面,存储这个 20K+ 字符的哈希值将是一个问题。因此我尝试了 CRC32(来自 java.util.zip.CRC32)。它确实生成了比哈希值小得多的长度校验和。
我确实了解校验和和哈希有不同的目的。出于上述目的,我可以只使用 CRC32 吗?它能解决目的还是我必须尝试比这两个更多的东西?
谢谢, 阿比
I am doing some image processing code where in I download some images(as BufferedImage) from URLs and pass it on to a image processor.
I want to avoid passing of the same image more than once to the image processor(as the image processing operation is of high cost). The URL end points of the images(if they are same images) may vary and hence I can prevent this by the URL. So I was planning to do a checksum or hash to identify if the code is encountering the same image again.
For md5 I tried Fast MD5, and it generated a 20K+ character length hex checksum value for the image(some sample). Obviously storing this 20K+ character hash would be an issue when it comes to database storage. Hence I tried the CRC32(from java.util.zip.CRC32). And it did generate quite smaller length check sum than the hash.
I do understand checksum and hash are for different purposes. For the purpose explained above can I just use the CRC32? Would it solve the purpose or I have to try something more than these two?
Thanks,
Abi
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
CRC 和 MD5 之间的区别在于,篡改文件以匹配“目标”MD5 比篡改文件以匹配“目标”校验和更困难。由于这对您的程序来说似乎不是问题,因此使用哪种方法并不重要。也许 MD5 可能会占用更多的 CPU 资源,但我不知道这种不同是否重要。
主要问题应该是摘要的字节数。
如果您以整数进行校验和,则意味着对于 2K 大小的文件,您要将 2^2048 个组合拟合为 2^32 个组合 -->对于每个 CRC 值,您将有 2^64 个可能的文件与其匹配。如果您有 128 位 MD5,则可能有 2^16 种冲突。
计算的代码越大,发生冲突的可能性就越小(假定计算的代码分布均匀),因此比较就越安全。
无论如何,为了最大限度地减少可能的错误,我认为第一个分类应该使用文件大小...首先比较文件大小,如果它们匹配,则比较校验和/哈希。
The difference between CRC and, say, MD5, is that it is more difficult to tamper a file to match a "target" MD5 than to tamper it to match a "target" checksum. Since this does not seem a problem for your program, it should not matter which method do you use. Maybe MD5 might be a little more CPU intensive, but I do not know if that different will matter.
The main question should be the number of bytes of the digest.
If you are doing a checksum in an integer will mean that, for a file of 2K size, you are fitting 2^2048 combinations into 2^32 combinations --> for every CRC value, you will have 2^64 possible files that match it. If you have a 128 bits MD5, then you have 2^16 possible collisions.
The bigger the code that you compute, the less possible collisions (given that the codes computed are distributed evenly), so the safer the comparation.
Anyway, in order to minimice possible errors, I think the first classification should be using file size... first compare file sizes, if they match then compare checksums/hash.
校验和和哈希值基本相同。您应该能够计算任何类型的哈希值。常规 MD5 通常就足够了。如果您愿意,您可以存储大小和 md5 哈希值(我认为是 16 个字节)。
如果两个文件的大小不同,则它们是不同的文件。您甚至不需要计算数据的哈希值。如果您不太可能有许多重复文件,并且文件较大(例如用相机拍摄的 JPG 图片),则此优化可能会节省您很多时间。
如果两个或多个文件具有相同的大小,您可以计算哈希值并比较它们。
如果两个哈希值相同,您可以比较实际数据以查看是否不同。这是非常非常不可能的,但理论上是可能的。哈希值越大(md5 为 16 个字节,而 CR32 只有 4 个字节),两个不同文件具有相同哈希值的可能性就越小。
不过,执行这项额外检查只需要 10 分钟的编程时间,所以我想说:安全总比后悔好。 :)
为了进一步优化这一点,如果两个文件大小完全相同,您可以比较它们的数据。无论如何,您都需要读取这些文件来计算它们的哈希值,所以如果它们是仅有的两个具有该特定大小的文件,为什么不直接比较它们呢?
A checksum and a hash are basically the same. You should be able to calculate any kind of hash. A regular MD5 would normally suffice. If you like, you could store the size and the md5 hash (which is 16 bytes, I think).
If two files have different sizes, thay are different files. You will not even need to calculate a hash over the data. If it is unlikely that you have many duplicate files, and the files are of the larger kind (like, JPG pictures taken with a camera), this optimization may spare you a lot of time.
If two or more files have the same size, you can calculate the hashes and compare them.
If two hashes are the same, you could compare the actual data to see if this is different after all. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, while CR32 is only 4), the less likely that two different files will have the same hash.
It will take only 10 minutes of programming to perform this extra check though, so I'd say: better safe than sorry. :)
To further optimize this, if exactly two files have the same size, you can just compare their data. You will need to read the files anyway to calculate their hashes, so why not compare them directly if they are the only two with that specific size.