使用 MD5 可以唯一标识多少数据(多少 MB)
我有数百万条数据记录,每条大小约为 2MB。这些数据中的每一个都存储在一个文件中,并且还有一组与该记录关联的其他数据(存储在数据库中)。
当我的程序运行时,我将在内存中看到其中一条数据记录,并需要生成关联的数据。为此,我想象获取内存的 MD5,然后使用此哈希值作为数据库的密钥。钥匙将帮助我找到其他数据。
我需要知道的是,数据内容的 MD5 哈希是否是唯一识别 2MB 数据的合适方法,这意味着我可以使用 MD5 哈希而不必太担心冲突吗?
我意识到有可能发生冲突,我关心的是数百万条 2MB 数据记录发生冲突的可能性有多大?碰撞有可能发生吗?与硬盘故障或其他计算机故障相比又如何呢? MD5可以安全识别多少数据?那么数百万 GB 的文件呢?
我不担心恶意或数据篡改。我有保护措施,不会收到被操纵的数据。
I've got millions of data records that are each about 2MB in size. Every one of these pieces of data are stored in a file and there is a set of other data associated with that record (stored in a database).
When my program runs I'll be presented, in memory, with one of the data records and need to produce the associated data. To do this I'm imagining taking an MD5 of the memory, then using this hash as a key into the database. The key will help me locate the other data.
What I need to know is if an MD5 hash of the data contents is a suitable way to uniquliy identify a 2MB piece of data, meaning can I use an MD5 hash without worrying too much about collisions?
I realize there is a chance for collision, my concern is how likely is the chance for collision on millions of 2MB data records? Is collision a likely occurrence? What about when compared to hard disk failure or other computer failures? How much data can MD5 be used to safely identify? what about millions of GB files?
I'm not worried about malice or data tampering. I've got protections such that I wont be receiving manipulated data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这可以归结为所谓的生日悖论。该维基百科页面简化了评估碰撞概率的公式。这将是一个非常小的数字。
下一个问题是如何处理 10-12 碰撞概率 - 请参阅 这个非常相似的问题。
This boils down to so-called Birthday paradox. That Wikipedia page has simplified formulas for evaluating the collision probability. It will be very some very small number.
The next question is how you deal with say 10-12 collision probability - see this very similar question.