MD5 等哈希函数有何独特之处?
我知道 MD5 存在一些冲突,但这更多的是关于哈希函数的高级问题。
如果 MD5 将任意字符串哈希为 32 位十六进制值,则根据 Pigeonhole原则 当然,这不可能是唯一的,因为唯一的任意字符串比唯一的 32 位十六进制值还要多。
I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
正如其他人回答的那样,哈希函数根据定义不能保证返回唯一值,因为无限数量的输入有固定数量的哈希值。它们的关键品质是它们的碰撞不可预测。
换句话说,它们不容易逆转——因此,虽然可能有许多不同的输入会产生相同的哈希结果(“冲突”),但找到其中任何两个在计算上是不可行的。
As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.
In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.
您是对的,它不能保证唯一性,但是 32 位十六进制值 (16^32) 中大约有 3.402823669209387e+38 个不同的值。这意味着,假设算法背后的数学给出了良好的分布,那么出现重复的可能性非常小。您必须记住,当您考虑如何使用它时,它是可能重复的。 MD5 通常用于确定某些内容是否已更改(即,它是校验和)。修改某些内容并产生相同的 MD5 校验和的可能性是极其不可能的。
编辑:(鉴于最近的新闻:SHA1 哈希值)
上面的答案仍然成立,但您不应该期望 MD5 哈希充当任何类型的针对操纵的安全检查。 SHA-1 哈希发生冲突的可能性降低了 2^32(超过 40 亿)倍,并且已经证明可以设计一个输入来产生相同的值。 (这在很久以前就已经针对 MD5 进行了演示)。如果您希望确保没有人恶意修改某些内容以产生相同的哈希值,那么现在您需要 SHA-2 来提供可靠的保证。
另一方面,如果不是在安全检查上下文中,MD5 仍然有其用处。
可以认为 SHA-2 哈希值的计算成本足够低,无论如何你都应该使用它。
You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.
Edit: (given recent news re: SHA1 hashes)
The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.
On the other hand, if it's not in a security check context, MD5 still has it's usefulness.
The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.
你是绝对正确的。但哈希值并不是“唯一”,而是“足够唯一”。
You are absolutely correct. But hashes are not about "unique", they are about "unique enough".
正如其他人指出的那样,像 MD5 这样的哈希函数的目标是提供一种轻松检查两个对象是否等效的方法,而无需知道它们最初是什么(密码)或对其进行整体比较(大文件)。
假设您有一个对象
O
及其哈希值 hO。您获得另一个对象P
并希望检查它是否等于O
。这可以是密码,也可以是您下载的文件(在这种情况下,您将没有O
,而是带有P 的哈希值 hO
,最有可能)。首先,对P
进行哈希处理以获得 hP。现在有两种可能性:
O
和P
是不同的,因为对 2 个值/对象使用相同的哈希必须产生相同的值。哈希值是确定性的。 没有漏报。hO 和 hP 相等。正如您所说,由于鸽洞原理,这可能意味着不同的对象散列为相同的值,并且可能需要采取进一步的操作。
a.因为可能性的数量如此之高,如果您对哈希函数有信心,那么可能足以说“嗯,碰撞的可能性是二分之一128(理想情况),所以我们可以例如,假设
O
=P
,如果您限制字符的长度和复杂性,这可能适用于密码,这就是为什么您会看到存储在数据库中的密码哈希值。密码本身。b.您可能认为哈希值相等并不意味着对象相等,因此直接比较
O
和P
。 您可能出现误报。因此,虽然您可能出现误报匹配,但不会出现误报。根据您的应用程序,以及您希望对象始终相等还是始终不同,散列可能是多余的步骤。
As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).
Say you have an object
O
and its hash hO. You obtain another objectP
and wish to check whether it is equal toO
. This could be a password, or a file you downloaded (in which case you won't haveO
but rather the hash of it hO that came withP
, most likely). First, you hashP
to get hP.There are now 2 possibilities:
O
andP
are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.
a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume
O
=P
. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves.b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of
O
andP
. You may have a false positive.So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.
根据定义的性质,加密单向哈希函数不是单射。
就哈希函数而言,“唯一”毫无意义。这些函数是通过其他属性来衡量的,这些属性使创建给定哈希的原像变得困难,从而影响了它们的强度。例如,我们可能关心改变原像中的单个位会影响多少图像位。我们可能关心进行暴力攻击(找到给定哈希图像的原始图像)有多困难。我们可能关心找到碰撞有多难:找到两个具有相同哈希图像的原像,用于 生日攻击。
Cryptographic one-way hash functions are, by nature of definition, not Injective.
In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.
虽然如果要散列的值比生成的散列长得多,则可能会发生冲突,但对于大多数用途而言,冲突数量仍然足够低(有 2128 可能的哈希总数,因此两个随机字符串产生相同哈希的几率理论上接近十分之一38)。
MD5 主要是为了进行完整性检查而创建的,因此它对最小的更改非常敏感。输入的微小修改将导致截然不同的输出。这就是为什么仅根据哈希值很难猜测密码的原因。
虽然哈希本身是不可逆的,但仍然可以通过纯粹的蛮力找到可能的输入值。这就是为什么如果您使用 MD5 存储密码哈希值,则应始终确保添加盐:如果您在输入字符串中包含盐,则匹配的输入字符串必须包含完全相同的盐才能得到相同的结果。输出字符串,否则与输出匹配的原始输入字符串在自动加盐后将无法匹配(即您不能只是“反转”MD5 并使用它来登录,因为反转的 MD5 哈希很可能不是加盐的)最初导致创建哈希的字符串)。
因此,哈希值不是唯一的,但可以通过身份验证机制使其足够唯一(这是密码限制代替加盐的一个有点合理的论点:产生相同哈希值的字符串集可能包含许多不重复的字符串)不遵守密码限制,因此通过暴力破解哈希值更加困难——显然盐仍然是一个好主意)。
更大的散列意味着同一输入集有更大的可能散列集,因此重叠的可能性更低,但在处理能力充分提高到使暴力破解 MD5 变得微不足道之前,对于大多数用途来说,它仍然是一个不错的选择。
While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.
While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).
So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).
Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.
(似乎是哈希函数星期日。)
加密哈希函数被设计为具有非常非常低的重复率。由于您所说的显而易见的原因,利率永远不可能为零。
维基百科页面提供了丰富的信息。
(It seems to be Hash Function Sunday.)
Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.
The Wikipedia page is informative.
正如迈克(以及基本上其他人)所说,它并不完美,但它完成了工作,并且碰撞性能确实取决于算法(实际上相当不错)。
真正感兴趣的是自动操作文件或数据以保持不同数据的相同哈希值,请参阅此 演示
As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).
What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo