MD5 是否仍然足以唯一标识文件?
考虑到 MD5 算法的破坏和安全问题等,MD5 散列文件是否仍然被认为是唯一识别该文件的足够好的方法?安全性不是我在这里最关心的问题,但唯一地标识每个文件才是。
有什么想法吗?
Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.
Any thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
是的。 MD5从安全角度来说已经被彻底攻破,但意外碰撞的概率仍然微乎其微。只需确保这些文件不是由您不信任且可能有恶意的人创建的。
Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.
出于实际目的,创建的散列可能是适当随机的,但理论上由于鸽洞原理。具有不同的哈希值当然意味着文件不同,但获得相同的哈希值并不一定意味着文件相同。
因此,为此目的使用哈希函数(无论是否考虑安全性)应该始终只是检查的第一步,特别是如果已知哈希算法很容易产生冲突。为了可靠地找出具有相同哈希值的两个文件是否不同,您必须逐字节比较这些文件。
For practical purposes, the hash created might be suitably random, but theoretically there is always a probability of a collision, due to the Pigeonhole principle. Having different hashes certainly means that the files are different, but getting the same hash doesn't necessarily mean that the files are identical.
Using a hash function for that purpose - no matter whether security is a concern or not - should therefore always only be the first step of a check, especially if the hash algorithm is known to easily create collisions. To reliably find out if two files with the same hash are different you would have to compare those files byte-by-byte.
如果没有对手,MD5 就足够了。但是,有人可以(故意)创建两个不同的文件,它们散列为相同的值(这称为冲突),这可能是也可能不是问题,具体取决于您的具体情况。
由于了解已知的 MD5 弱点是否适用于给定上下文是一件微妙的事情,因此建议不要使用 MD5。使用抗冲突哈希函数(SHA-256 或 SHA-512)是安全的答案。另外,使用 MD5 是不好的公共关系(如果您使用 MD5,请准备好为自己辩护;而没有人会质疑您使用 SHA-256)。
MD5 will be good enough if you have no adversary. However, someone can (purposely) create two distinct files which hash to the same value (that's called a collision), and this may or may not be a problem, depending on your exact situation.
Since knowing whether known MD5 weaknesses apply to a given context is a subtle matter, it is recommended not to use MD5. Using a collision-resistant hash function (SHA-256 or SHA-512) is the safe answer. Also, using MD5 is bad public relations (if you use MD5, be prepared to have to justify yourselves; whereas nobody will question your using SHA-256).
md5 可能会产生冲突。从理论上讲,尽管可能性很小,但连续一百万个文件可以产生相同的哈希值。在存储该值之前,不要碰运气并检查 md5 冲突。
我个人喜欢创建随机字符串的 md5,这可以减少散列大文件的开销。当发现冲突时,我使用附加的循环计数器进行迭代和重新散列。
您可以阅读鸽子洞原理。
An md5 can produce collisions. Theoretically, although highly unlikely, a million files in a row can produce the same hash. Don't test your luck and check for md5 collisions before storing the value.
I personally like to create md5 of random strings, which reduces the overhead of hashing large files. When collisions are found, I iterate and re-hash with the appended loop counter.
You may read on the pigeonhole principle.
我不会推荐它。如果应用程序可以在多用户系统上运行,则可能有用户拥有两个具有相同 md5 哈希值的文件(他可能是工程师并使用这些文件,或者只是好奇 - 它们可以轻松从 http://www2.mat.dtu.dk/people/S.Thomsen/wangmd5/samples.html ,我本人在撰写此答案期间下载了两个示例)。另一件事是,某些应用程序可能出于某种原因存储此类重复项(我不确定是否存在此类应用程序,但存在这种可能性)。
如果您唯一地标识您的程序生成的文件,我会说使用 MD5 是可以的。否则,我会推荐任何其他尚无冲突的哈希函数。
I wouldn't recommend it. If the application would work on multi-user system, there might be user, that would have two files with the same md5 hash (he might be engineer and play with such files, or be just curious - they are easily downloadable from http://www2.mat.dtu.dk/people/S.Thomsen/wangmd5/samples.html , I myself during writing this answer downloaded two samples). Another thing is, that some applications might store such duplicates for whatever reason (I'm not sure, if there are any such applications but the possibility exists).
If you are uniquely identifying files generated by your program I would say it is ok to use MD5. Otherwise, I would recommend any other hash function where no collisions are known yet.
我个人认为,当人们真正想做的是拥有唯一标识符时,他们过多地使用其他对象的原始校验和(选择你的方法)来充当唯一标识符。为此用途对对象进行指纹识别并不是本意,并且可能比使用 uuid 或类似的完整性机制需要更多的思考。
Personally i think people use raw checksums (pick your method) of other objects to act as unique identifiers way too much when they really want to do is have unique identifiers. Fingerprinting an object for this use wasn't the intent and is likely to require more thinking than using a uuid or similar integrity mechanism.
MD5 已被破坏,您可以使用 SHA1 代替(在大多数语言中实现)
MD5 has been broken, you could use SHA1 instead (implemented in most languages)
当散列短(<几K?)字符串(或文件)时,可以创建两个 md5 散列键,一个用于实际字符串,第二个用于与短非对称字符串连接的字符串的反向。示例:md5(反转(字符串 || '1010'))。添加额外的字符串可确保即使由一系列相同位组成的文件也会生成两个不同的密钥。请理解,即使在这种方案下,理论上两个哈希键对于不同的字符串也有相同的可能性,但概率似乎非常小——大约是单个 md5 碰撞概率的平方,并且节省了时间当文件数量增加时,可能会相当大。也可以考虑创建第二个字符串的更复杂的方案,但我不确定这些是否会大大提高几率。
要检查冲突,可以运行此测试来确定数据库中所有 bit_vector 的 md5 哈希键的唯一性:
select md5 ( bit_vector ), count(*), bit_and ( bit_vector)
来自带有 bit_vector 的数据库
按 md5( 位向量 )、位向量分组
具有 bit_and ( bit_vector ) <>位向量
When hashing short (< a few K ?) strings (or files) one can create two md5 hash keys, one for the actual string and a second one for the reverse of the string concatenated with a short asymmetric string. Example : md5 ( reverse ( string || '1010' ) ). Adding the extra string ensures that even files consisting of a series of identical bits generate two different keys. Please understand that even under this scheme there is a theoretical chance of the two hash keys being identical for non-identical strings, but the probability seems exceedingly small - something in the order of the square of the single md5 collision probability, and the time saving can be considerable when the number of files is growing. More elaborate schemes for creating the second string could be considered as well, but I am not sure that these would substantially improve the odds.
To check for collisions one can run this test for the uniqueness of the md5 hash keys for all bit_vectors in a db:
select md5 ( bit_vector ), count(*), bit_and ( bit_vector)
from db with bit_vector
group by md5( bit_vector ), bit_vector
having bit_and ( bit_vector ) <> bit_vector
我喜欢将 MD5 视为存储大量文件数据时的概率指标。
如果哈希值相等,我就知道我必须逐字节比较文件,但这可能只会因为错误的原因而发生几次,否则(哈希值不相等)我可以确定我们正在谈论两个不同的文件。
I like to think of MD5 as an indicator of probability when storing a large amount of file data.
If the hashes are equal I then know I have to compare the files byte by byte, but that might only happen a few times for a false reason, otherwise (hashes are not equal) I can be certain we're talking about two different files.