导致 MD5 冲突的最短字符串对是什么?

发布于 2024-08-16 22:16:44 字数 176 浏览 8 评论 0原文

可以使用 MD5 作为哈希值,而不必担心冲突的可能性,最大字符串长度是多少?

这可能是通过为特定字符集中的每个可能的字符串生成 MD5 哈希来计算的,长度不断增加,直到哈希第二次出现(冲突)。没有冲突的字符串的最大可能长度将比冲突对中最长的一个字符少一个字符。

是否已针对 MD5、SHA1 等进行过测试?

Up to what string length is it possible to use MD5 as a hash without having to worry about the possibility of a collision?

This would presumably be calculated by generating an MD5 hash for every possible string in a particular character set, in increasing length, until a hash appears for a second time (a collision). The maximum possible length of a string without a collision would then be one character less than the longest of the colliding pair.

Has this already been tested for MD5, SHA1, etc?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

芯好空 2024-08-23 22:16:44

更新

讽刺的是,在我发布之前的答案几周后,两位中国研究人员谢涛和冯登国发表了MD5 的新单块碰撞。直到现在我才知道那篇论文。单个 MD5 块意味着输入大小为 64 字节或 512 位。请注意,输入大部分相同,仅存在 2 位差异

他们的方法要到 2013 年 1 月才会发布,但现在可以使用论文中的数字来验证他们的冲突:

>>> from array import array
>>> from hashlib import md5
>>> input1 = array('I',  [0x6165300e,0x87a79a55,0xf7c60bd0,0x34febd0b,0x6503cf04,
    0x854f709e,0xfb0fc034,0x874c9c65,0x2f94cc40,0x15a12deb,0x5c15f4a3,0x490786bb,
    0x6d658673,0xa4341f7d,0x8fd75920,0xefd18d5a])
>>> input2 = array('I', [x^y for x,y in zip(input1,
    [0, 0, 0, 0, 0, 1<<10, 0, 0, 0, 0, 1<<31, 0, 0, 0, 0, 0])])
>>> input1 == input2
False
>>> md5(input1).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'
>>> md5(input2).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'

更新: 该论文已于 2013 年 3 月发布:Tao Xie 和 Fanbao Liu 和 Dengguo Feng - MD5 的快速碰撞攻击

但是,如果你有更多的空间可以玩,a 的碰撞几千字节的计算速度要快得多——在任何普通计算机上都可以在几小时内计算出它们。

旧答案

之前最短的冲突至少使用了两个 MD5 块的输入——即 128 字节,1024 位。第一个块中的前缀可以由攻击者任意选择,其余的将被计算并显​​示为乱码。

下面是两个不同的冲突输入的示例,您可以在 Python 中自行尝试:

>>> from binascii import unhexlify
>>> from hashlib import md5
>>> input1 = 'Oded Goldreich\nOded Goldreich\nOded Goldreich\nOded Go' + unhexlify(
... 'd8050d0019bb9318924caa96dce35cb835b349e144e98c50c22cf461244a4064bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf67c12978647f5af123de3acf844085cd025b956')
>>> len(input1)
128
>>> md5(input1).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'
>>> input2 = 'Neal Koblitz\nNeal Koblitz\nNeal Koblitz\nNeal Koblitz\n' + unhexlify(
... '75b80e0035f3d2c909af1baddce35cb835b349e144e88c50c22cf461244a40e4bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf6fc11978647f5af123de3acf84408dcd025b956')
>>> md5(input2).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'

在 215 个节点的 Playstation 3 集群上生成这两个特定输入花了 2 天,马克·史蒂文斯 :)

Update

Ironically, a few weeks after I posted the previous answer, two Chinese researchers, Tao Xie and Dengguo Feng, published a new single-block collision for MD5. I was unaware of that paper until now. A single MD5 block means that the input size is 64 bytes or 512 bits. Note that the inputs are mostly the same, differing only in 2 bits.

Their methodology won't be published until January 2013, but their collision can be verified now, using numbers from the paper:

>>> from array import array
>>> from hashlib import md5
>>> input1 = array('I',  [0x6165300e,0x87a79a55,0xf7c60bd0,0x34febd0b,0x6503cf04,
    0x854f709e,0xfb0fc034,0x874c9c65,0x2f94cc40,0x15a12deb,0x5c15f4a3,0x490786bb,
    0x6d658673,0xa4341f7d,0x8fd75920,0xefd18d5a])
>>> input2 = array('I', [x^y for x,y in zip(input1,
    [0, 0, 0, 0, 0, 1<<10, 0, 0, 0, 0, 1<<31, 0, 0, 0, 0, 0])])
>>> input1 == input2
False
>>> md5(input1).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'
>>> md5(input2).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'

Update: The paper has been published in March 2013: Tao Xie and Fanbao Liu and Dengguo Feng - Fast Collision Attack on MD5

However, if you have more room to play with, collisions of a few kilobytes are MUCH faster to calculate -- they can be calculated within hours on ANY regular computer.

Old answer

The previous shortest collision used at least two MD5 blocks worth of input -- that's 128 bytes, 1024 bits. A prefix in the first block can be chosen arbitrarily by the attacker, the rest would be computed and appear as gibberish.

Here's an example of two different colliding inputs, you can try it yourself in Python:

>>> from binascii import unhexlify
>>> from hashlib import md5
>>> input1 = 'Oded Goldreich\nOded Goldreich\nOded Goldreich\nOded Go' + unhexlify(
... 'd8050d0019bb9318924caa96dce35cb835b349e144e98c50c22cf461244a4064bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf67c12978647f5af123de3acf844085cd025b956')
>>> len(input1)
128
>>> md5(input1).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'
>>> input2 = 'Neal Koblitz\nNeal Koblitz\nNeal Koblitz\nNeal Koblitz\n' + unhexlify(
... '75b80e0035f3d2c909af1baddce35cb835b349e144e88c50c22cf461244a40e4bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf6fc11978647f5af123de3acf84408dcd025b956')
>>> md5(input2).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'

Generating these two particular inputs took 2 days on a 215-node Playstation 3 cluster, by Mark Stevens :)

梦毁影碎の 2024-08-23 22:16:44

生日悖论的数学原理使碰撞概率的拐点大致在 sqrt(N) 附近,其中 N 是哈希函数中不同 bin 的数量,因此对于 128 位哈希,当您获得大约 64 位时,您很可能会发生 1 次冲突。所以我的猜测是,对于完整的 8 字节字符串集,有可能发生冲突,而对于 9 字节字符串,则极有可能发生冲突。

编辑:这假设 MD5 哈希算法导致从输入字节串到输出哈希的映射接近“随机”。 (与在一组可能的哈希值中更均匀地分布字符串的方法相比,在这种情况下,它将更接近 16 个字节。)

另外,对于更具体的数字答案,如果您查看 计算碰撞概率的近似值之一,您可以得到

p(k) ≈ 1 - e-k(k-1 )/(2*2128) 其中 k = 可能输入的空间大小 = 2m 其中输入字节串的长度为 m 位。

8 个字节字符串的集合:p(264) ≈ 1 - e-0.5 ≈ 0.3935

9 个字节字符串的集合:p(272 sup>) ≈ 1 - e-2144/(2*2128) = 1 - e-2 15 = 1 - e-32768 ≈ 1

另请注意,这些假设是 m/8 字节字符串的完整集。如果您只使用字母数字字符,则需要更多字节才能发生可能的冲突。

The mathematics of the birthday paradox make the inflection point of probability of collision roughly around sqrt(N), where N is the number of distinct bins in the hash function, so for a 128-bit hash, as you get around 64 bits you are moderately likely to have 1 collision. So my guess is for the complete set of 8 byte strings it's somewhat likely to have a collision, and for 9 byte strings it's extremely likely.

edit: this assumes that the MD5 hash algorithm causes a mapping from input bytestring to output hash that is close to "random". (vs. one that distributes strings more evenly among the set of possible hashes, in which case it would be more close to 16 bytes.)

Also for a more specific numerical answer, if you look at one of the approximations for calculating collision probability, you get

p(k) ≈ 1 - e-k(k-1)/(2*2128) where k = the size of the space of possible inputs = 2m where the input bytestring is m bits long.

the set of 8 byte strings: p(264) ≈ 1 - e-0.5 ≈ 0.3935

the set of 9 byte strings: p(272) ≈ 1 - e-2144/(2*2128) = 1 - e-215 = 1 - e-32768 ≈ 1

Also note that these assume the complete set of m/8 byte strings. If you only use alphanumeric characters, you'd need more bytes to get a probable collision.

沉溺在你眼里的海 2024-08-23 22:16:44

我怀疑是否有任何有用的长度不会发生可能的碰撞。这些算法并没有真正用于此目的。它的目的是尝试在数据的微小变化(例如损坏的文件)中保持唯一,而不是在所有可能的数据集中保持唯一。

I doubt if there is any useful length where you're not going to have possible collisions. Those algorithms are not really used for that purpose. It's meant to try to be unique for slight changes in the data (like corrupted files) rather than unique over all possible sets of data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文