我可以使用 md5 防止重复内容吗?
我想防止重复的内容。 我不想保留内容的副本,因此我决定仅保留 md5 签名。
我读到 md5 冲突确实发生了,不同的内容可能会给出相同的 md5 签名。
你觉得md5够用吗?
我应该同时使用 md5 和 sh1 吗?
I would like to prevent duplicate content. I do not want to keep a copies of content, so I decided to keep just the md5 signatures.
I read that md5 collisions do happen, different content could give in the same md5 signature.
Do you think md5 is enough?
Should I use md5 and sh1 together?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
人们已经能够在人为的情况下故意产生 MD5 冲突,但对于防止重复内容(在没有恶意用户的情况下)来说,这已经足够了。
话虽如此,如果您可以使用 SHA-1(或 SHA-2),那么您应该使用 SHA-1(或 SHA-2),这样您就可以稍微但明显地避免碰撞。
People have been able to deliberately produce MD5 collisions under contrived circumstances, but for preventing duplicate content (in the absence of malicious users) it's more than adequate.
Having said that, if you can use SHA-1 (or SHA-2) you should - you'll be fractionally but measurably safer from collisions.
MD5 应该没问题,碰撞很少见,但如果你真的担心,也可以使用 sha-1。
虽然我猜签名确实没有那么大,所以如果您有空闲的处理周期和磁盘空间,您可以同时执行这两项操作。 但如果空间或速度有限,我就选择一个。
MD5 should be fine, collisions are very rare, but if you're really worried, you can use sha-1 as well.
Though I guess the signatures really aren't that large, so if you have the spare processing cycles and the disk space, you could do both. But if space or speed is limited, I'd just go with one.
如果存在哈希冲突,为什么不简单地逐字节比较内容呢? 哈希冲突非常罕见,因此您只需很少进行逐字节检查即可。 这样,只有当项目确实重复时才会检测到重复项
Why not simply compare the content byte for byte if there is a hash collision? hash collisions are very rare, and so you're only going to have to do a byte for byte check very rarely. That way duplicates will only be detected if the items are actually duplicated
md5应该足够了。 是的,可能会发生冲突,但发生这种情况的可能性非常小,除非您确实在跟踪数十亿条内容,否则我不会担心它。
md5 should be enough. Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content.
如果您真的害怕意外冲突,只需同时进行 MD5 和 SHA1 哈希并进行比较即可。 如果两者匹配,则内容相同。 如果其中任何一个不同,则内容不同。
If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them. If they both match, it's the same content. If either one differs, it's different content.
组合算法仅用于混淆,但不会提高哈希算法的安全性。
恕我直言,MD5 太糟糕了,无法使用。 研究人员证明了伪造 MD5 哈希值,他们证明能够伪造生成 MD5 冲突的内容,从而为生成伪造的 CSR 以便从 RapidSSL 为他们不拥有的域名购买证书打开了大门。 现在就安全! 第 179 集解释了整个过程。
对我来说,基于 SHA 的哈希值更强大,并且大多数开发平台都支持它,因此选择很容易。 剩下的决定因素是块大小。
Combining algorithms serves to only obfuscate, but does not increase security in a hashing algorithm.
MD5 is too broken to use anyway, IMHO. Forging MD5 hashes is proven by researchers, where they demonstrated being able to forge content that generates an MD5 collision, thereby opening the door to generating a forged CSR to buy a cert from RapidSSL for a domain name they don't own. Security Now! episode 179 explains the process.
For me, SHA-based hashes are stronger and most development platforms support it so the choice is easy. The remaining deciding factor is then the block size.
时间戳+md5一起就足够安全了。
A timestamp + md5 together are safe enough.
MD5已被破坏,SHA1已接近它。 使用 SHA2。
编辑
根据OP的更新,故意碰撞似乎并不是这里的一个严重问题。 对于无意的情况,任何至少具有 64 位输出的体面哈希都可以。
一般来说,我仍然会避免使用 MD5 甚至 SHA1,但没有理由对此武断。 如果该工具适合这里,那么一定要使用它。
MD5 is broken and SHA1 is close to it. Use SHA2.
edit
Based on an update from the OP, it doesn't seem that intentional collisions are a serious concern here. For unintentional ones, any decent hash with at least a 64-bit output would be fine.
I would still avoid MD5 and even SHA1, in general, but there's no reason to be dogmatic about it. If the tool fits here, then by all means use it.