我应该如何处理应用程序中的校验和冲突?
我的应用程序有一部分用于存储文件。因为我们可能会添加许多相同的文件,所以我首先保留每个文件的哈希值。如果两个文件具有相同的哈希值,那么我们会丢弃其中一个,并且对该文件的两个“引用”都指向同一个物理文件。
我应该在多大程度上担心哈希冲突?
如果发生碰撞我该怎么办?到目前为止,我的代码的整个关键取决于不存在具有相同哈希值的两个不同文件。如果现在发生冲突,我的应用程序将抛出一个完全不同的文件并指向具有相同哈希值的文件。
我应该使用 MD5 以外的其他东西吗? SHA-1 是否具有更好的冲突率?
I have a part of my application that stores files. Because we could potentially be adding many of the same file, I am first keeping a hash of each file. If two files have the same hash, then we throw out one, and both "references" to that file point to the same physical file.
How much should I be worried about hash collisions?
In the case of a collision what should I do? The whole crux of my code so far depends on there not being two different files with the same hash. In the event of a collision right now, my app would throw out a legitmately different file and point to the file with the same hash.
Should I be using something other than MD5? Does SHA-1 have a better collision rate?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
除非您处于某些非常关键的应用程序中,否则不要担心哈希冲突。它们是如此罕见,以至于许多事情都假设它们不会发生,如果这种假设最终错误一次,这些事情就会发生灾难性的事情。
SHA1 比 MD5 具有更大的输出空间(并且已知的攻击也更少),因此它绝对不是一个更糟糕的选择。如果您担心有人主动碰撞您的哈希值,那么 SHA 的最新变体(例如 SHA-256)可能是个好主意。
Unless you're in some really REALLY critical application, do not worry about hash collisions. They are so rare that many things assume they are not going to happen, and catastrophic things will happen to these things if that assumption ends up being false just once.
SHA1 has a larger output space than MD5 (and fewer attacks are known on it, too), so it's definitely not a worse choice. If you are afraid of someone actively colliding your hashes, perhaps a later variant of SHA, such as SHA-256, might be a good idea.
任意两个随机选择的比特流的哈希值之间发生冲突的机会与哈希值所代表的不同状态的数量成反比。因此,64 位哈希对
2 ** 64
状态进行编码,并且任何文件对都有1 / (2**64)
发生冲突的机会。但是您确实关心一组(大)文件发生冲突的可能性,因此您需要进行“生日悖论”计算,插入成对碰撞的概率和预期的文件数量。但我认为底线是,在不进行比较的情况下丢弃文件是不安全的事情,即使数字表明发生冲突的可能性很小。
The chance of a collision between the hashes of any two randomly selected bitstreams is the inversely proportional to the number of distinct states that the hash represents. So a 64 bit hash encodes
2 ** 64
states and has a chance of1 / (2**64)
of a collision for any pair of files. But you are really concerned with the chance of a collisions over a (large) set of files, so you need to do the "birthday paradox" calculation, plugging the probability of a pairwise collision and the expected number of files.But I think that the bottom line is that throwing away a file without doing a comparison is an unsafe thing to do, even if the numbers say that the probability of a collision is small.
在所提供的场景中,您永远不必担心。两个不同的文档不可能具有相同的校验和,除非它们相同。想象一下:
var a = 1;
var b = 2;
b+3=5; // 真的耶!
a + 3 != 5; // 只要 var a 不等于 2,就不可能发生冲突
var 'a' 与除 2 之外的任何值都永远无法计算为 5,因此不可能发生冲突。由于您正在使用(或应该使用)单向校验和哈希算法,因此生成的哈希将始终取决于其输入
当您处理随机生成的哈希时,会发生哈希冲突,由于其随机未指定的输入可能会发生冲突,尽管可能性很小。
请注意,我绝不推断哈希算法是通过简单的加法完成的一种方式。我只是使用加法作为一个简单的例子,基于一个简单的概念,即它们都采用一组值并根据它们输出不同的设置值。
In the provided scenario you never have to worry. It is not possible for 2 different documents to have the same checksum unless they are the same. Imagine this:
var a = 1;
var b = 2;
b + 3 = 5; // true yay!
a + 3 != 5; // no collision possible as long as var a does not equal 2
var 'a' with any value other than 2 can never ever compute to 5 so no collision possible. Since you are using (or should be using) a 1 way checksum hashing algorithm the resulting hash will always be dependent on its inputs
Hash collisions happen when you're dealing with randomly generated hashes that due to their random unspecified inputs could collide though very unlikely.
Please note I am in no way inferring that one way hashing algorithms are accomplished through simple addition. I'm merely using addition as a simple example based on the simple notion that they both take a set of values and output a different set values based upon them.
我应该在多大程度上担心哈希冲突?
R- 取决于这对您的应用程序有多重要。
如果发生碰撞我该怎么办?到目前为止,我的代码的整个关键取决于不存在具有相同哈希值的两个不同文件。如果现在发生冲突,我的应用程序将抛出一个完全不同的文件并指向具有相同哈希值的文件。
R- 为了避免哈希比较发生冲突,您可以在检测到相似的哈希后,向两个文件字节添加一些字节,再次生成哈希并进行另一次比较。无论执行多少次,如果其中一个哈希比较不同,文件就会不同。
我应该使用 MD5 以外的其他东西吗? SHA-1 是否具有更好的冲突率?
R- 算法在计算哈希时使用的字长越大,发生冲突的机会就越低。 如果使用的算法来自 SHA-2 系列,则可以使用此逻辑。
How much should I be worried about hash collisions?
R- Depend on how critical this is to your application.
In the case of a collision what should I do? The whole crux of my code so far depends on there not being two different files with the same hash. In the event of a collision right now, my app would throw out a legitmately different file and point to the file with the same hash.
R- To avoid collision on hash comparison you can after detect the similar hashes, add some bytes to both files bytes, generate the hash again and do another comparison. No matter how many times you do this, if one of the hash comparisons is different, the files will be different.
Should I be using something other than MD5? Does SHA-1 have a better collision rate?
R- The larger the word size used by the algorithm in computing the hash, the lower the chance of a collision occurring. You can use this logic if the algorithm used is from the SHA-2 family.