为什么 MD5 哈希值不可逆?
我一直想知道的一个概念是加密哈希函数和值的使用。 我知道这些函数可以生成一个唯一且几乎无法逆转的哈希值,但这就是我一直想知道的:
如果在我的服务器上,在 PHP 中我生成:
md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"
当您通过 MD5 函数运行相同的字符串时,您在 PHP 安装中得到相同的结果。 一个过程被用来从某个起始值产生一些值。
这是否意味着有某种方法可以解构正在发生的事情并反转哈希值?
这些函数是什么导致结果字符串无法回溯?
One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered:
If on my server, in PHP I produce:
md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"
When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value.
Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value?
What is it about these functions that makes the resulting strings impossible to retrace?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
输入材料可以是无限长度,而输出始终为 128 位长。 这意味着无限数量的输入字符串将生成相同的输出。
如果您选择一个随机数并将其除以 2,但只写下余数,您将得到 0 或 1——分别为偶数或奇数。 是否有可能取 0 或 1 并得到原始数字?
The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output.
If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?
如果像 MD5 这样的哈希函数是可逆的,那么这将是数据压缩算法历史上的一个分水岭事件! 很容易看出,如果 MD5 是可逆的,那么任意大小的任意数据块都可以用仅仅 128 位来表示,而不会丢失任何信息。 因此,无论原始消息的大小如何,您都可以从 128 位数字重建原始消息。
If hash functions such as MD5 were reversible then it would have been a watershed event in the history of data compression algorithms! Its easy to see that if MD5 were reversible then arbitrary chunks of data of arbitrary size could be represented by a mere 128 bits without any loss of information. Thus you would have been able to reconstruct the original message from a 128 bit number regardless of the size of the original message.
与这里最受支持的答案所强调的相反,加密哈希函数的非注入性(即,有多个字符串哈希为相同的值)是由大(可能无限)输入之间的差异引起的大小和固定输出大小不是重点 - 实际上,我们更喜欢那些冲突尽可能少发生的哈希函数。
考虑这个函数(用 PHP 表示法,作为问题):
如果字符串太短,它会附加一些空格,然后获取字符串的前 16 个字节,然后将其编码为十六进制。 它具有与 MD5 散列相同的输出大小(32 个十六进制字符,如果我们省略 bin2hex 部分,则为 16 个字节)。
这将输出:
此函数还具有与 Cody 对 MD5 的答案所强调的相同的非注入性属性:我们可以传入任何大小的字符串(只要它们适合我们的计算机),并且它将仅输出 32 个十六进制数字。 当然不能是内射的。
但在这种情况下,找到映射到相同散列的字符串是微不足道的(只需在散列上应用
hex2bin
即可)。 如果您的原始字符串的长度为 16(如我们的示例),您甚至会得到这个原始字符串。 对于 MD5 来说,这种情况是不可能的,即使您知道输入的长度非常短(除了尝试所有可能的输入,直到找到匹配的输入,例如暴力攻击)。加密散列函数的重要假设是:
显然我的 simple_hash 函数不满足这些条件。 (实际上,如果我们将输入空间限制为“16 字节字符串”,那么我的函数就变成单射的,因此甚至可以证明具有第二原像抗性和抗碰撞性。)
现在存在针对 MD5 的碰撞攻击(例如,可以生成一对字符串,即使具有给定的相同前缀,也具有相同的哈希值,需要做相当多的工作,但并非不可能做很多工作),因此您不应该将 MD5 用于任何关键的事情。
目前还没有原像攻击,但攻击会变得更好。
回答实际问题:
MD5(以及基于 Merkle-Damgard 构造的其他哈希函数)有效地执行的操作是应用加密算法,以消息作为密钥,将某个固定值作为“纯文本”,并使用生成的密文作为哈希值。 (在此之前,输入被填充并分割成块,每个块用于加密前一个块的输出,与其输入进行异或以防止反向计算。)
现代加密算法(包括散列函数中使用的算法)即使同时给出明文和密文(或者即使对手选择其中之一),也很难恢复密钥。
他们通常通过进行大量的位混洗操作来实现这一点,其中每个输出位由每个密钥位(多次)以及每个输入位确定。 这样,如果您知道完整的密钥以及输入或输出,您就可以轻松地追溯内部发生的情况。
对于类似 MD5 的哈希函数和原像攻击(使用单块哈希字符串,使事情变得更容易),您只有加密函数的输入和输出,但没有密钥(这就是您正在寻找的)。
Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible.
Consider this function (in PHP notation, as the question):
This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part).
This will output:
This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective.
But in this case, it is trivial to find a string which maps to the same hash (just apply
hex2bin
on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack).The important assumptions for a cryptographic hash function are:
Obviously my
simple_hash
function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.)There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical.
There is not yet a preimage attack, but attacks will get better.
To answer the actual question:
What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.)
Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them).
They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output.
For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).
科迪·布罗西斯的答案是正确的。 严格来说,您不能“反转”哈希函数,因为许多字符串映射到相同的哈希。 但请注意,要么查找映射到给定哈希的一个字符串,要么查找映射到同一哈希的两个字符串(即碰撞< /em>),对于密码分析师来说将是重大突破。 这两个问题的巨大困难正是良好的哈希函数在密码学中有用的原因。
Cody Brocious's answer is the right one. Strictly speaking, you cannot "invert" a hash function because many strings are mapped to the same hash. Notice, however, that either finding one string that gets mapped to a given hash, or finding two strings that get mapped to the same hash (i.e. a collision), would be major breakthroughs for a cryptanalyst. The great difficulty of both these problems is the reason why good hash functions are useful in cryptography.
MD5 不会创建唯一的哈希值; MD5 的目标是快速生成一个因源的微小变化而发生显着变化的值。
例如,
(显然这不是真正的 MD5 加密)
大多数哈希(如果不是全部)也不是唯一的; 相反,它们足够独特,因此碰撞的可能性极小,但仍然有可能发生。
MD5 does not create a unique hash value; the goal of MD5 is to quickly produce a value that changes significantly based on a minor change to the source.
E.g.,
(Obviously that's not actual MD5 encryption)
Most hashes (if not all) are also non-unique; rather, they're unique enough, so a collision is highly improbable, but still possible.
考虑哈希算法的一个好方法是考虑在 Photoshop 中调整图像大小...假设您有一个 5000x5000 像素的图像,然后将其大小调整为 32x32。 您所拥有的仍然是原始图像的表示,但它要小得多,并且有效地“丢弃”了图像数据的某些部分以使其适合较小的尺寸。 因此,如果您将 32x32 图像的大小调整回 5000x5000,您得到的只是一团模糊的混乱。 然而,由于 32x32 图像没有那么大,理论上可以想象另一个图像可以缩小尺寸以产生完全相同的像素!
这只是一个类比,但它有助于理解哈希的作用。
A good way to think of a hash algorithm is to think of resizing an image in Photoshop... say you have a image that is 5000x5000 pixels and you then resize it to just 32x32. What you have is still a representation of the original image but it is much much smaller and has effectively "thrown away" certain parts of the image data to make it fit in the smaller size. So if you were to resize that 32x32 image back up to 5000x5000 all you'd get is a blurry mess. However because a 32x32 image is not that large it would be theoretically conceivable that another image could be downsized to produce the exact same pixels!
That's just an analogy but it helps understand what a hash is doing.
哈希冲突的可能性比您想象的要大得多。 查看生日悖论,更好地理解其中的原因。
A hash collision is much more likely than you would think. Take a look at the birthday paradox to get a greater understanding of why that is.
由于可能的输入文件数量大于 128 位输出文件的数量,因此不可能为每个可能的文件唯一分配 MD5 哈希值。
加密哈希函数用于检查数据完整性或数字签名(为提高效率而对哈希进行签名)。 因此,更改原始文档意味着原始哈希与更改后的文档不匹配。
有时使用这些标准:
选择这些标准是为了使找到与给定散列匹配的文档变得困难,否则就有可能通过用散列匹配的文档替换原始文档来伪造文档。 (即使替换是乱码,仅仅替换原始内容也可能会造成中断。)
数字3意味着数字2。
特别是对于MD5,它已被证明是有缺陷的:
如何破解 MD5 和其他哈希函数。
As the number of possible input files is larger than the number of 128-bit outputs, it's impossible to uniquely assign an MD5 hash to each possible.
Cryptographic hash functions are used for checking data integrity or digital signatures (the hash being signed for efficiency). Changing the original document should therefore mean the original hash doesn't match the altered document.
These criteria are sometimes used:
These criterial are chosen to make it difficult to find a document that matches a given hash, otherwise it would be possible to forge documents by replacing the original with one that matched by hash. (Even if the replacement is gibberish, the mere replacement of the original may cause disruption.)
Number 3 implies number 2.
As for MD5 in particular, it has been shown to be flawed:
How to break MD5 and other hash functions.
但这就是彩虹表发挥作用的地方。
基本上,它只是将大量值单独散列,然后将结果保存到磁盘。 那么反转位“只是”在一个非常大的表中进行查找。
显然,这仅适用于所有可能输入值的子集,但如果您知道输入值的界限,则可能可以对其进行计算。
But this is where rainbow tables come into play.
Basically it is just a large amount of values hashed separetely and then the result is saved to disk. Then the reversing bit is "just" to do a lookup in a very large table.
Obviously this is only feasible for a subset of all possible input values but if you know the bounds of the input value it might be possible to compute it.
中国科学家找到了一种称为“选择前缀冲突”的方法来在两个不同的字符串之间产生冲突。
下面是一个示例:http://www.win.tue。 nl/hashclash/fastcoll_v1.0.0.5.exe.zip
源代码: http://www.win.tue.nl/hashclash /fastcoll_v1.0.0.5_source.zip
Chinese scientist have found a way called "chosen-prefix collisions" to make a conflict between two different strings.
Here is an example: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5.exe.zip
The source code: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5_source.zip
了解所有投票最多的答案的含义的最佳方法是实际尝试恢复 MD5 算法。 我记得几年前我尝试恢复MD5crypt算法,不是为了恢复原始消息,因为这显然是不可能的,而是为了生成一条消息,该消息会产生与原始哈希相同的哈希。 至少从理论上讲,这将为我提供一种使用生成的消息(密码)而不是使用原始消息登录到将 user:password 存储在 /etc/passwd 文件中的 Linux 设备的方法。 由于两条消息将具有相同的结果哈希,因此系统会将我的密码(从原始哈希生成)识别为有效。 那根本不起作用。 几周后,如果我没记错的话,在最初的消息中使用盐杀死了我。 我不仅必须生成有效的初始消息,还要生成加盐的有效初始消息,但我始终无法做到这一点。 但我从这个实验中获得的知识很好。
The best way to understand what all the most voted answers meant is to actually try to revert the MD5 algorithm. I remember I tried to revert the MD5crypt algorithm some years ago, not to recover the original message because it is clearly impossible, but just to generate a message that would produce the same hash as the original hash. This, at least theoretically, would provide me a way to login to a Linux device that stored the user:password in the /etc/passwd file using the generated message (password) instead of using the original one. Since both messages would have the same resulting hash, the system would recognize my password (generated from the original hash) as valid. That didn't work at all. After several weeks, if I remember correctly, the use of salt in the initial message killed me. I had to produce not only a valid initial message, but a salted valid initial message, which I was never able to do. But the knowledge that I got from this experiment was nice.
正如大多数人已经说过的那样,MD5 是为将可变长度数据流哈希为固定长度数据块而设计的,因此单个哈希值由许多输入数据流共享。
但是,如果您确实需要从校验和中找出原始数据,例如,如果您有密码的哈希值并且需要找出原始密码,那么通过谷歌(或您喜欢的任何搜索器)搜索哈希值通常会更快寻找答案而不是暴力破解。 我已经用这个方法成功找到了一些密码。
As most have already said MD5 was designed for variable length data streams to be hashed to a fixed length chunk of data, so a single hash is shared by many input data streams.
However if you ever did need to find out the original data from the checksum, for example if you have the hash of a password and need to find out the original password, it's often quicker to just google (or whatever searcher you prefer) the hash for the answer than to brute force it. I have successfully found out a few passwords using this method.
如今,MD5 哈希值或任何其他与此相关的哈希值都会针对所有可能的字符串进行预先计算并存储起来以便于访问。 虽然理论上 MD5 是不可逆的,但使用此类数据库,您可能会发现哪些文本产生了特定的哈希值。
例如,在 http://gdataonline.com/seekhash.php 尝试以下哈希代码以找出答案我用什么文本来计算哈希值
Now a days MD5 hashes or any other hashes for that matter are pre computed for all possible strings and stored for easy access. Though in theory MD5 is not reversible but using such databases you may find out which text resulted in a particular hash value.
For example try the following hash code at http://gdataonline.com/seekhash.php to find out what text i used to compute the hash
f(x) = 1 是不可逆的。 哈希函数不是不可逆的。
这实际上是他们履行确定某人是否拥有哈希数据的未损坏副本的功能所必需的。 这使得暴力攻击变得非常容易,这种攻击现在非常强大,尤其是针对 MD5。
在拥有数学知识但缺乏密码破译知识的人们中,这里和其他地方也存在着困惑。 一些密码只是将数据与密钥流进行异或,因此您可以说密文对应于该长度的所有明文,因为您可以使用任何密钥流。
然而,这忽略了从种子
password
生成的合理明文比由种子Wsg5Nm^bkI4EgxUOhpAjTmTjO0F!VkWvysS6EEMsIJiTZcvsh@WI$IH$TYqiWvK!%& 生成的合理明文的可能性要大得多。 Ue&nk55ak%BX%9!NnG%32ftud%YkBO$U6o
以至于任何声称第二种可能性的人都会被嘲笑。同样,如果您尝试在两个潜在密码
password
和Wsg5Nm^bkI4EgxUO
之间做出决定,这并不像一些数学家让您相信的那么困难。f(x) = 1 is irreversible. Hash functions aren't irreversible.
This is actually required for them to fulfill their function of determining whether someone possesses an uncorrupted copy of the hashed data. This brings susceptibility to brute force attacks, which are quite powerful these days, particularly against MD5.
There's also confusion here and elsewhere among people who have mathematical knowledge but little cipherbreaking knowledge. Several ciphers simply XOR the data with the keystream, and so you could say that a ciphertext corresponds to all plaintexts of that length because you could have used any keystream.
However, this ignores that a reasonable plaintext produced from the seed
password
is much, much more likely than another produced by the seedWsg5Nm^bkI4EgxUOhpAjTmTjO0F!VkWvysS6EEMsIJiTZcvsh@WI$IH$TYqiWvK!%&Ue&nk55ak%BX%9!NnG%32ftud%YkBO$U6o
to the extent that anyone claiming that the second was a possibility would be laughed at.In the same way, if you're trying to decide between the two potential passwords
password
andWsg5Nm^bkI4EgxUO
, it's not as difficult to do as some mathematicians would have you believe.根据定义,加密哈希函数不应该是可逆的,并且应该具有尽可能少的冲突。
关于你的问题:这是一种单向哈希。 输入(无论长度)将生成固定大小的输出,该输出将根据算法进行填充(MD5 为 512 位边界)。 信息被压缩(丢失)并且实际上不可能通过逆变换生成。
有关 MD5 的附加信息:它很容易发生冲突。 我最近浏览了这篇文章,
http://www.win.tue.nl/hashclash/Nostradamus/
打开加密哈希实现(MD5 和 SHA)的源代码可以在 Mozilla 代码中找到。
(freebl 库)。
By definition, a cryptographic hash function should not be invertible and should have the least collisions possible.
Regarding your question: it is a one way hash. The input (irrespective of length) will generate a fixed size output, which will be padded based on algo (512 bit boundary for MD5). The information is compressed (lost) and practically not possible to generate from reverse transforms.
Additional info on MD5: it is vulnerable to collisions. I have gone through this article recently,
http://www.win.tue.nl/hashclash/Nostradamus/
Open source code for crypto hash implementations (MD5 and SHA) can be found at Mozilla code.
(freebl library).
我喜欢所有不同的论点。
显然,哈希值的真正价值只是为密码等字符串提供人类无法读取的占位符。
它没有特定的增强安全优势。
假设攻击者获得了对带有散列密码的表的访问权限,他/她可以: 对
在这种情况下,弱密码不能仅通过哈希值来保护。
I like all the various arguments.
It is obvious the real value of hashed values is simply to provide human-unreadable placeholders for strings such as passwords.
It has no specific enhanced security benefit.
Assuming an attacker gained access to a table with hashed passwords, he/she can:
In this case weak passwords cannot be protected by the mere fact that they are hashed.