当前位置：文江博客话题详情

两个不同的字符串可以生成相同的MD5哈希码吗？

发布于 2024-08-11 19:46:44 字数 114 浏览 6 评论 0原文

我们为每个二进制资产生成一个 MD5 哈希值。这用于检查某个二进制资产是否已在我们的应用程序中。但是两个不同的二进制资产是否有可能生成相同的 MD5 哈希值。那么两个不同的字符串有可能生成相同的 MD5 哈希值吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

滥情稳全场 2024-08-18 19:46:44

对于一组甚至数十亿的资产，随机冲突的可能性小到可以忽略不计——您无需担心。考虑到生日悖论，给定一组 2^64（或 18,446,744,073,709,551,616）资产，概率此集合中单个 MD5 冲突为 50%。按照这种规模，您可能会在存储容量方面击败 Google。

然而，由于 MD5 哈希函数已被破坏（它容易受到碰撞攻击），任何<坚定的攻击者可以在几秒钟内产生相当于 CPU 能力的 2 个冲突资产。因此，如果您想使用 MD5，请确保此类攻击者不会损害您应用程序的安全！

另外，请考虑如果攻击者可能与数据库中的现有资产发生冲突所产生的后果。虽然没有针对 MD5 的已知攻击（原像攻击）（截至 2011 年），但它可能通过扩展当前对碰撞攻击的研究，使这一目标成为可能。

如果这些结果成为问题，我建议查看 SHA-2 系列哈希函数（SHA-256、SHA-384 和 SHA-512）。缺点是它的速度稍慢并且哈希输出更长。

回复收藏 0 原文

匿名的好友 2024-08-18 19:46:44

MD5 是一个哈希函数 – 所以，是的，两个不同的字符串绝对可以生成冲突的 MD5 代码。

特别要注意的是，MD5 代码具有固定长度，因此 MD5 代码的可能数量是有限的。然而，字符串（任意长度）的数量绝对是无限的，因此从逻辑上讲，必然存在冲突。

回复收藏 0 原文

亢潮 2024-08-18 19:46:44

是的，两个不同的字符串有可能生成相同的 MD5 哈希码。

这是一个使用非常相似的十六进制字符串二进制消息的简单测试：

$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c6b384c4968b28812b676b49d40c09f8af4ed4cc  -
008ee33a9d58b51cfeb425b0959121c9

$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c728d8d93091e9c7b87b43d9e33829379231d7ca  -
008ee33a9d58b51cfeb425b0959121c9

它们生成不同的 SHA-1 和，但生成相同的 MD5 哈希值。其次，字符串非常相似，因此很难找到它们之间的差异。

可以通过以下命令找到差异：

$ diff -u <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2 | fold -w2) <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2 | fold -w2)
--- /dev/fd/63  2016-02-05 12:55:04.000000000 +0000
+++ /dev/fd/62  2016-02-05 12:55:04.000000000 +0000
@@ -33,7 +33,7 @@
 af
 bf
 a2
-00
+02
 a8
 28
 4b
@@ -53,7 +53,7 @@
 6d
 a0
 d1
-55
+d5
 5d
 83
 60

上面的碰撞示例取自 Marc Stevens： < em>MD5 的单块碰撞，2012；他用源代码解释了他的方法（论文的备用链接）。

另一个测试：

$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
756f3044edf52611a51a8fa7ec8f95e273f21f82  -
cee9a457e790cf20d4bdaa6d69f01e41

$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
6d5294e385f50c12745a4d901285ddbffd3842cb  -
cee9a457e790cf20d4bdaa6d69f01e41

不同的 SHA-1 和，相同的 MD5 哈希。

区别仅在于一个字节：

$ diff -u <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2) <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2)
--- /dev/fd/63  2016-02-05 12:56:43.000000000 +0000
+++ /dev/fd/62  2016-02-05 12:56:43.000000000 +0000
@@ -19,7 +19,7 @@
 03
 65
 9e
-70
+74
 4f
 85
 34
@@ -41,7 +41,7 @@
 a3
 f4
 15
-5c
+dc
 bb
 86
 07

以上示例改编自Tao Xie和Dengguo Feng：Construct MD5 Collisions using Just单个消息块，2010。

相关：

是否有两个已知字符串具有相同的 MD5哈希值？在 Crypto.SE

Yes, it is possible that two different strings can generate the same MD5 hash code.

Here is a simple test using very similar binary message in hex string:

$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c6b384c4968b28812b676b49d40c09f8af4ed4cc  -
008ee33a9d58b51cfeb425b0959121c9

$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c728d8d93091e9c7b87b43d9e33829379231d7ca  -
008ee33a9d58b51cfeb425b0959121c9

They generate different SHA-1 sum, but the same MD5 hash value. Secondly the strings are very similar, so it's difficult to find the difference between them.

The difference can be found by the following command:

$ diff -u <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2 | fold -w2) <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2 | fold -w2)
--- /dev/fd/63  2016-02-05 12:55:04.000000000 +0000
+++ /dev/fd/62  2016-02-05 12:55:04.000000000 +0000
@@ -33,7 +33,7 @@
 af
 bf
 a2
-00
+02
 a8
 28
 4b
@@ -53,7 +53,7 @@
 6d
 a0
 d1
-55
+d5
 5d
 83
 60

Above collision example is taken from Marc Stevens: Single-block collision for MD5, 2012; he explains his method, with source code (alternate link to the paper).

Another test:

$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
756f3044edf52611a51a8fa7ec8f95e273f21f82  -
cee9a457e790cf20d4bdaa6d69f01e41

$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
6d5294e385f50c12745a4d901285ddbffd3842cb  -
cee9a457e790cf20d4bdaa6d69f01e41

Different SHA-1 sum, the same MD5 hash.

Difference is in one byte:

$ diff -u <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2) <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2)
--- /dev/fd/63  2016-02-05 12:56:43.000000000 +0000
+++ /dev/fd/62  2016-02-05 12:56:43.000000000 +0000
@@ -19,7 +19,7 @@
 03
 65
 9e
-70
+74
 4f
 85
 34
@@ -41,7 +41,7 @@
 a3
 f4
 15
-5c
+dc
 bb
 86
 07

Above example is adapted from Tao Xie and Dengguo Feng: Construct MD5 Collisions Using Just A Single Block Of Message, 2010.

Are there two known strings which have the same MD5 hash value? at Crypto.SE

回复收藏 0 原文

家住魔仙堡 2024-08-18 19:46:44

是的，这是可能的。这实际上是一个生日问题。然而，两个随机选择的字符串具有相同 MD5 哈希值的概率非常低。

请参阅此和这个问题作为示例。

回复收藏 0 原文

北斗星光 2024-08-18 19:46:44

是的，当然：MD5 哈希值的长度是有限的，但可以对无数可能的字符串进行 MD5 哈希值。

回复收藏 0 原文

晌融 2024-08-18 19:46:44

是的，这是可能的。这称为哈希冲突。

话虽如此，MD5 等算法的设计目的是最大限度地减少碰撞的可能性。

MD5 上的 Wikipedia 条目解释了 MD5 中的一些漏洞，您应该注意这些漏洞。

回复收藏 0 原文

时光无声 2024-08-18 19:46:44

是的！发生碰撞是有可能的（尽管风险很小）。如果没有，您将拥有一个非常有效的压缩方法！

编辑：正如 Konrad Rudolph 所说：将一组可能无限的输入转换为一组有限的输出（32 个十六进制字符）将导致无数的冲突。

回复收藏 0 原文

毅然前行 2024-08-18 19:46:44

只是为了提供更多信息。
从数学角度来看，哈希函数不是内射。
这意味着起始集和结果集之间不存在 1 对 1（而是单向）关系。

维基百科上的双射

编辑：存在完整的单射哈希函数：它被称为完美哈希。

回复收藏 0 原文

青丝拂面 2024-08-18 19:46:44

我认为我们需要根据我们的要求谨慎选择哈希算法，因为哈希冲突并不像我预期的那么罕见。我最近在我的项目中发现了一个非常简单的哈希冲突案例。我正在使用 xxhash 的 Python 包装器进行哈希处理。链接：https://github.com/ewencp/pyhashxx

s1 = 'mdsAnalysisResult105588'
s2 = 'mdsAlertCompleteResult360224'
pyhashxx.hashxx(s1) # Out: 2535747266
pyhashxx.hashxx(s2) # Out: 2535747266

它导致系统中出现非常棘手的缓存问题，然后我终于发现是哈希冲突了。

I think we need to be careful choosing the hashing algorithm as per our requirement, as hash collisions are not as rare as I expected. I recently found a very simple case of hash collision in my project. I am using Python wrapper of xxhash for hashing. Link: https://github.com/ewencp/pyhashxx

s1 = 'mdsAnalysisResult105588'
s2 = 'mdsAlertCompleteResult360224'
pyhashxx.hashxx(s1) # Out: 2535747266
pyhashxx.hashxx(s2) # Out: 2535747266

It caused a very tricky caching issue in the system, then I finally found that it's a hash collision.

回复收藏 0 原文

原谅我要高飞 2024-08-18 19:46:44

正如其他人所说，是的，两个不同的输入之间可能会发生冲突。但是，在您的用例中，我认为这不是问题。我非常怀疑你会遇到冲突 - 我在之前的工作中使用 MD5 对多种图像（JPG、位图、PNG、raw）格式的数十万个图像文件进行指纹识别，并且没有发生冲突。

但是，如果您尝试对某种数据进行指纹识别，也许您可以使用两种哈希算法 - 一个输入导致两种不同算法产生相同输出的可能性几乎是不可能的。

回复收藏 0 原文

嗳卜坏 2024-08-18 19:46:44

我意识到这已经过时了，但我想我会贡献我的解决方案。有 2^128 种可能的哈希组合。因此生日悖论的概率为 2^64。虽然下面的解决方案不会消除碰撞的可能性，但它肯定会大大降低风险。

2^64 = 18,446,744,073,709,500,000 possible combinations

我所做的就是根据输入字符串将一些散列放在一起，以获得更长的结果字符串，您将其视为散列......

所以我的伪代码是：

Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))

这实际上是不可能发生碰撞的。但是，如果您想变得超级偏执并且不能让它发生，并且存储空间不是问题（计算周期也不是问题）...

Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string)) 
         & Hash(Reverse(SpellOutLengthWithWords(Length(string)))) 
         & Hash(Rotate13(string)) Hash(Hash(string)) & Hash(Reverse(Hash(string)))

好吧，这不是最干净的解决方案，但这现在可以让您更多地了解如何您很少会遇到碰撞。就这一点而言，我可能认为从该术语的所有现实意义上讲都是不可能的。

就我个人而言，我认为发生碰撞的可能性很小，因此我认为这不是“万无一失”，但不太可能发生，因此它适合需要。

现在，可能的组合显着增加。虽然你可能会花很长时间来思考这会给你带来多少种组合，但我想说的是，从理论上讲，它会让你比上面引用的“可能”数字显着多

2^64 (or 18,446,744,073,709,551,616)

出一百位数左右。 The theoretical max this could give you would be

Possible number of resulting strings:

528294531135665246352339784916516606518847326036121522127960709026673902556724859474417255887657187894674394993257128678882347559502685537250538978462939576908386683999005084168731517676426441053024232908211188404148028292751561738838396898767036476489538580897737998336

I realize this is old, but thought I would contribute my solution. There are a 2^128 possible hash combinations. And thus a 2^64 probability of a birthday paradox. While the solution below won't eliminate possibility of collisions, it surely will reduce the risk by a very substantial amount.

2^64 = 18,446,744,073,709,500,000 possible combinations

What I have done is I put a few hashes together based on the input string to get a much longer resulting string that you consider your hash...

So my pseudo-code for this is:

Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))

That is to practical improbability of a collision. But if you want to be super paranoid and can't have it happen, and storage space is not an issue (nor is computing cycles)...

Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string)) 
         & Hash(Reverse(SpellOutLengthWithWords(Length(string)))) 
         & Hash(Rotate13(string)) Hash(Hash(string)) & Hash(Reverse(Hash(string)))

Okay, not the cleanest solution, but this now gets you a lot more play with how infrequently you will run into a collision. To the point I might assume impossibility in all realistic senses of the term.

For my sake, I think the possibility of a collision is infrequent enough that I will consider this not "surefire" but so unlikely to happen that it suits the need.

Now the possible combinations goes up significantly. While you could spend a long time on how many combinations this could get you, I will say in theory it lands you SIGNIFICANTLY more than the quoted number above of

2^64 (or 18,446,744,073,709,551,616)

Likely by a hundred more digits or so. The theoretical max this could give you would be

Possible number of resulting strings:

回复收藏 0 原文

回忆凄美了谁 2024-08-18 19:46:44

看来理论理解在实践中没有帮助，需要知道两个数字1和0意味着1111111111，所以100意味着10倍。

要完全使用您需要在一个文件系统或一个生日系统上使用的哈希，每个人都需要拥有184446744073709551616/8000000000 = 2305843009.21每个人的文件，如果它以1MB大小为2305843009 MB或2305843009 MB或23055843 GB或230522222 Google 为每人提供 15 GB 免费空间。

如果我们使文件更大，那么使用的空间更多，文件数量更少意味着哈希值更少。所以我们仍然不会有更小的文件，而只会更大。

计算一下文件需要有多大，这样我们才能用 MD5 填充所有哈希值。

如果平均文件大小在 2002 年为 3.22 MB，在 2005 年为 8.92 MB，我们可以假设我们仍然使用相同质量的文件大小。因此，即使谷歌文件系统在一个系统上也永远不会有这么多文件，因为如果世界上每 8 万人平均有 3 MB 的 15GB 免费谷歌驱动器充满，那么所有 MD5 哈希值中的 40000000000000 个文件将成为所有可哈希值的 0.0000021684%文件大小。

谈论不相关的事情，例如 2 人的 100 生日的生日将比较 2 天或 0.02，而在 2 人的 365 中将比较 0.00547%
MD5 文件占所有文件的 2/18446744073709551616=0.0000000000000000000108420217%（如果存在这么多）。

这就像在亚当和夏娃的世界里问他们是否有相同的哈希生日，而世界上没有 365 个人，文件系统文件或根本没有那么多密码。

因此，尝试破解的冲突如此之多，这在现实生活中的安全服务器中是不可能的。

如果 MD5 的完整限制是 18,446,744,073,709,551,616 那么你在全世界将永远不会有这么多文件。

MD5 是把世界上所有的字符串都计入哈希值的例子，它永远不会存在这么久，所以这只是 MD5 太短的问题，但是我们是否会有数万亿个巨大长度的字符串具有真正相同的哈希值？

实际上，这就像将 365 个不同日期的婴儿与 366 个婴儿进行比较，找出哪个生日是相同的。

正如您所看到的，所有答案在理论上都回答“是”，但无法证明现实生活中的例子。如果是密码，那么只有很长的字符串可能与短的字符串相同。

如果其文件标识散列，则使用不同的散列或它们的组合。

生日问题是因为一个人的单词“abcd”是一个 4 个字母的单词，而其他人的 DNA 只有在“abcdfijdfj”时才可能相同。

如果你阅读维基百科关于生日问题的内容，它不仅像生日日期，而且像生日出生日期、小时、秒、毫秒等等，就像DNA问题。

有了哈希，你可以和双胞胎拥有相同的DNA和生日吗？没有。甚至有时也和别人一起。

生日悖论当然是概率置信数学技巧结果 365 个选项或天的可能性，而哈希值是多少？更多。因此，如果您有 2 个不同的匹配字符串，这只是因为 MD5 哈希值对于太多文件而言太短，因此请使用比 MD5 更长的字符串。

它不是比较 365 天内的 50 个婴儿，而是比较 2 个哈希值，如果它们与经过哈希处理的多个长度字符串相同，如 abcd 与 25 个字母 abcdef...zdgdege 和 150 个字母 sadiasdjsadijfsdf.sdaidjsad.dfijsdf 相同。

因此，如果它的密码，那么它的生日同级将会更长，甚至不存在，因为没有人创建 25 个字母的密码。

对于文件大小比较，我不确定概率有多大，但不是 97%，甚至不是 0.0000001%。

好吧，让我们更具体一些。

如果它的文件可能出现在巨大的系统中，因为文件会有所不同，但在实践中不会出现问题，因为对于 UUID 和 MD5，5 个万亿或 5 000 000 000 000 000 个文件应该位于同一系统上。

如果是密码，则需要 10 年的时间每秒尝试一次，但可以每毫秒尝试一次，但如果 3 次错误的猜测，则阻止 ip 1 分钟将导致猜测数百万年。

当我看到有什么不对劲的时候，我就知道它是错的。理论承诺与现实。

It looks like theory understanding doesn't help when talking about theory in practice and need to know what means only 2 numbers 1 and 0 it means 1111111111, so 100 means 10 times of that.

To have at all hashes used you need on one filesystem or one birthday system every person in world would need to have 18446744073709551616/8000000000=2305843009.21 files for every person and if its in 1mb size then its 2305843009 MB or 2305843 GB or 2305 TB or 153722 Google drives free 15 GB per each person.

If we make files bigger, then more space used and less file count means less hashes. So we still wont have smaller size files but only bigger.

Calculate someone how big files needs to be so that we could have MD5 all hashes filled.

If average file size is in 2002 3.22 MB in 2005 8.92 and we can assume we still use same quality of file size. so even google filesystem would never have so many files on one system since if 15gb free google drive full with average a lot small 3 mb files for every 8 milliard people in world would make 40000000000000 that's from all MD5 hashes 0.0000021684% of possible of all hashable file sizes.

Talking about non related things like birthdays of 100 birth year day of 2 people would be comparing 2 days or 0.02 and in 365 of 2 people would be comparing 0.00547%
MD5 files 2/18446744073709551616=0.0000000000000000000108420217% of all files if so many would exist at all.

It like asking in world of adam and eve if they have the same hash birthday when there no 365 people in world or in file system files or so many password at all.

So collisions of trying to hack are so many that are impossible in real life secured server.

If MD5 full limit is 18,446,744,073,709,551,616 then you will never have so much files in whole world.

MD5 is example of having all world strings counted into hashes, which will never exist so long, so its just a problem of MD5 being short, but do we will have trillion amount of strings of huge length having really the same hash?

Actually it would be like comparing 365 different day babies with 366 baby to find out which birthday is the same.

As you see all answers are theoretically answering yes, but fail to prove real life examples. If its password, then only very long string might be same as short one.

If its file identification hashing then use different hashing or combination of them.

Birthday problem is as one person is word "abcd" a 4 letter word while other person DNA could be the same only if its "abcdfijdfj".

If you read wikipedia about birthday problem, its not only like birthday date but birthday birth date, hour, second, ms and more like DNA problem.

With hash you can have same DNA and birthday with twins? Nope. With someone else even sometimes.

Birthday paradox is certainly probability confidence math trick result possibility of 365 options or days, while hash is from how much? Much more. So if you have 2 different matching string, its just because MD5 hash is too short for too many files, so use something longer then MD5.

Its not comparing 50 babies in 365 days, its comparing 2 hashes if they are the same from multiple length strings been hashed like abcd same as 25 letter abcdef...zdgdege and 150 letter sadiasdjsadijfsdf.sdaidjsad.dfijsdf.

So if its password, then its birthday sibling will be much longer that doesn't even exist, since no one makes birth of 25 letter password.

For file size comparing, I'm not sure how big the probability is but its not 97% and not even 0.0000001%.

Ok let's be more specific.

If its file then can occur of huge system since files will be different but needs to be not a problem in practice since 5 quadrillion or 5 000 000 000 000 000 files should be on same system for UUID and for MD5.

And if it is a password, then 10 years to try every second, but could try every millisecond, but then in 3 wrong guesses blocking ip for 1minute would make guessing millions of years.

When I see something wrong, then I know it's wrong. Theory promises vs reality.

回复收藏 0 原文

~没有更多了~