md5 校验和误报的可能性有多大?
我有一个客户正在内部分发大型二进制文件。他们还传递文件的 md5 校验和,并在将文件用作其工作流程的一部分之前,显然会根据校验和验证文件。
然而,他们声称“经常”遇到文件损坏的情况,而 md5 仍然表明该文件是好的。
我读到的所有内容都表明这种情况不太可能发生。
这听起来有可能吗?另一种哈希算法会提供更好的结果吗?我是否应该真正关注流程问题,例如他们声称检查校验和,但实际上并没有这样做?
注意,我还不知道“经常”在这种情况下意味着什么。他们每天处理数百个文件。我不知道这是每天、每月还是每年都会发生。
I have a client who is distributing large binary files internally. They are also passing md5 checksums of the files and apparently verifying the files against the checksum before use as part of their workflow.
However they claim that "often" they are encountering corruption in the files where the md5 is still saying that the file is good.
Everything I've read suggests that this should be hugely unlikely.
Does this sound likely? Would another hashing algorithm provide better results? Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?
NB, I don't yet know what "often" means in this context. They are processing hundreds of files a day. I don't know if this is a daily, monthly or yearly occurrence.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
MD5 是一个 128 位加密哈希函数,因此不同的消息应该在 128 位空间上分布得很好。这意味着两个文件(不包括专门为击败 MD5 而构建的文件)应该有 1 in 2^128 的冲突机会。换句话说,如果每纳秒比较一对文件,那么这种情况就不会发生。
MD5 is a 128 bit cryptographic hash function, so different messages should be distributed pretty well over the 128-bit space. That would mean that two files (excluding files specifically built to defeat MD5) should have a 1 in 2^128 chance of collision. In other words, if a pair of files is compared every nanosecond, it wouldn't have happened yet.
如果文件损坏,则损坏的文件与未损坏的文件具有相同 md5 校验和的概率为 1:2^128。换句话说,这种情况几乎“经常”发生,而且从未发生过。从天文数字上看,您的客户误报实际发生的情况的可能性更大(就像他们计算了错误的哈希值一样)
If a file is corrupted, then the probability that the corrupted file has the same md5 checksum as the uncorrupted file is 1:2^128. In other words, it will happen almost as "often" as never. It is astronomically more likely that your client is misreporting what really happened (like they are computing the wrong hash)
听起来像是他们使用 MD5 时的错误(也许他们 MD5 处理了错误的文件),或者是他们正在使用的库中的错误。例如,我曾经使用过的一个较旧的 MD5 程序无法处理超过 2GB 的文件。
这个问题表明,平均而言,您会遇到碰撞如果每秒生成 60 亿个文件,则平均每 100 年生成一次,所以这是不太可能的。
Sounds like a bug in their use of MD5 (maybe they are MD5-ing the wrong files), or a bug in the library that they're using. For example, an older MD5 program that I used once didn't handle files over 2GB.
This question suggests that, on average, you get a collision on average every 100 years if you were generating 6 billion files per second, so it's quite unlikely.
不会,导致相同校验和的随机损坏的几率为 2128 中的 1 或 3.40 × 1038。这个数字让十亿分之一 (109) 的机会相形见绌。
可能不会。虽然 MD5 在抗碰撞攻击方面已被破坏,但它可以很好地抵抗随机损坏,并且是一种流行的使用标准。
可能,但请考虑所有可能的问题点:
如果是最后一个,那么最后一个想法是在包装器中分发文件强制操作员解包文件的格式,但解包会在提取过程中进行验证。我想像 Gzip 或 7-Zip 这样的东西支持大文件并可能关闭压缩(我不知道那些会这样做)。
No, the chance of a random corruption causing the same checksum is 1 in 2128 or 3.40 × 1038. This number puts 1 in a billion (109) chance to shame.
Probably not. While MD5 has been broken for collision-resistance against attack, it's fine against random corruption and a popular standard to use.
Probably, but consider all possible points of problems:
IF it is the last, then one final thought is to distribute the files in a wrapper format that forces the operator to unwrap the file, but the unwrapping does verification during extraction. I thinking something like Gzip or 7-Zip that supports large files and possibly turning off compression (I don't know that those do).
有各种各样的原因导致二进制文件无法分发,或者即使分发,也会出现损坏(防火墙、大小限制、病毒插入等)。发送二进制文件时,您应该始终对文件进行加密(即使是低级加密也比不加密好),以帮助保护数据完整性。
There are all sorts of reasons that binaries either won't get distributed or if they do, there is corruption (firewall, size limitation, virus insertions, etc). You should always encrypt files (even a low level encryption is better than none) when sending binary files to help protect data integrity.
无法抗拒粗略计算:
有 2^128 种可能的 MD5 哈希值或 c。 3.4 x 10^38(即赔率 3400 亿、10 亿、10 亿、10 亿、10 亿、10 亿、10 亿、10 亿、10 亿、10 亿对 1)。让我们称这个数字为“M”
如果第 1 到 (K-1) 个匹配没有匹配,则第 K 个哈希匹配的概率为 (1-(K-1)/M),因为我们已经有了 K-1 个唯一哈希M 的值。
并且 P(N 个文件哈希中没有重复项)= 产品 [k = 1...N] (1-(k-1)/M)。当 N^2 <<< 时M 那么这近似于 1 - 1/2 N^2 / M 并且 P(一个或多个重复项) = 1/2 N^2 / M 当 1/2 N^2 是成对匹配数的近似值时必须生成的哈希值
因此,假设我们拍摄地球上每个人的照片(78 亿,或略低于 2^33),那么需要进行 304 亿亿次成对比较(略低于 2^65) )。
这意味着匹配 MD5 哈希的机会(假设完全均匀分布)仍然是 2^65/2^128 = 2^-63 或 10,000,000,000,000,000,000 分之一。
MD5 对于非敌对环境来说是一个相当不错的哈希函数,这意味着您的客户出现错误匹配的可能性远低于他们的首席执行官发疯并烧毁数据中心的可能性,更不用说他们真正担心的事情了关于。
Couldn't resist a back-of-envelope calculation:
There are 2^128 possible MD5 hashes or c. 3.4 x 10^38 (that is odds 340 billion, billion, billion, billion, billion, billion, billion, billion, billion,billion, billion to 1 against). Lets call this number 'M'
The probability of the Kth hash matching if the 1 to (K-1)th matches didn't is (1-(K-1)/M) as we have K-1 unique hashes already out of M.
And P(no duplicate in N file hashes) = Product [k = 1...N] (1-(k-1)/M). When N^2 <<< M then this approximates to 1 - 1/2 N^2 / M and P(one or more duplicates) = 1/2 N^2 / M when 1/2 N^2 is an approximation to the number of pair-wise matches of hashes that have to be made
So lets say we take photograph of EVERYONE ON THE PLANET (7.8 billion, or a little under 2^33) then there are 30.4 billion billion billion pair-wise comparisons to make (a little under 2^65).
This means that the chance of a matching MD5 hash (assuming perfectly even distribution) is still 2^65/2^128 = 2^-63 or 1 in 10,000,000,000,000,000,000.
MD5 is a pretty decent hash function for non-hostile environments which means the chance of your clients having a false match is far less likely than say the chance of their CEO going crazy and burning down the data centre, let alone the stuff they actually worry about.