使用 URI 的 md5 哈希作为数据库中的主键的优点和缺点
我正在构建一个数据库,该数据库将存储一系列对象(例如科学论文、标本、DNA 序列等)的信息,这些对象都在线存在,并且可以通过 URL 或标识符(例如 < a href="http://www.doi.org/" rel="noreferrer" title="DOI">DOI。 使用这些 GUID 作为对象的主键似乎是一个合理的想法,我遵循了 delicious和 Connotea 使用 GUID 的 md5 哈希值。 如果您将鼠标悬停在美味或 Connotea 书签中的编辑或删除按钮上,您将在浏览器状态栏中看到 md5 哈希值。 例如,http://stackoverflow/ 的书签,
http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d
其中 e4a42d992025b928a586b8bdc36ad38d a 是 http://stackoverflow/。
有人对这种方法的利弊有什么看法吗?
对我来说,这种方法的一个优点(相对于使用数据库本身生成的自动递增主键)是我必须在对象之间进行大量链接,并且通过使用 md5 哈希,我可以将这些链接存储在外部文件中(例如,作为数据挖掘/抓取的结果),然后将它们批量导入数据库。 同样,如果必须从头开始重建数据库,则对象的 URL 不会更改,因为它们使用 md5 哈希。
我欢迎任何关于这听起来是否合理,或者是否有其他(更好?)方法来做到这一点的想法。
I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is
http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d
where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.
Does anybody have views on the pros and cons of this approach?
For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.
I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果输入不是故意恶意的,那就没问题。 如果他们可能是恶意的,那就要看情况了。
在所有实际场景中,MD5 的意外冲突都是不可能的(要获得 50% 的冲突机会,您必须每秒散列 6十亿 个 URL >,每秒,100 年)。
由于未检测到的硬件故障而导致数据混乱的可能性比因实际碰撞而导致数据混乱的可能性要高出数万亿倍。
但是,如果输入可能是恶意的,则取决于您要防御的内容。
您需要故意与另一个预先存在的 URL 的哈希值发生冲突的冲突类型称为原像攻击。 没有已知的针对 MD5 的原像攻击。 截至 2017 年,还没有任何研究接近可行性,因此,即使资金雄厚的攻击者也无法计算出可哈希为数据库中任何现有 URL 的哈希值的 URL。
唯一已知的针对 MD5 的碰撞攻击是通过生成一对
二进制 blob仅相互碰撞来实现的。编辑:2024 年已经证明了 ASCII 冲突。
It's OK if the inputs aren't deliberately malicious. If they could be malicious, it depends.
Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).
It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.
However, if the inputs can be malicious, it depends what you're defending against.
The type of collision you'd need to intentionally collide with a hash of another pre-existing URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.
The only known collision attack against MD5 works by generating a pair of
binary blobsthat collide only with each other.edit: In 2024 an ASCII collision has been demonstrated.
在浏览 stackoverfow 多一点后,我发现了一个早期的问题 GUID 的优点和缺点/UUID 数据库密钥 涵盖了这方面的大部分内容。
After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.
多个字符串可以产生相同的 md5 哈希值。 主键必须是唯一的。 所以用hash作为主键是不好的。 更好的方法是直接使用 GUID。
是适合在 URL 中使用的 GUID。 当然。 这是我使用 Java 创建的 GUID(实际上是 UUID): 1ccb9467-e326-4fed-b9a7-7edcba52be84
url 可能是:
它有点长,但完全可用,并且实现了您所描述的目标。
Multiple strings can produce the same md5 hash. Primary keys must be unique. So using the hash as the primary key is not good. Better is to use the GUID directly.
Is a GUID suitable for use in a URL. Sure. Here's a GUID (actually, a UUID) I jsut created using Java: 1ccb9467-e326-4fed-b9a7-7edcba52be84
The url could be:
It's longish but perfectly usable and achieves what you describe.
也许您想阅读以下文档:
http:// www.hpl.hp.com/techreports/2002/HPL-2002-216.pdf
Maybe this document is something you want to read:
http://www.hpl.hp.com/techreports/2002/HPL-2002-216.pdf
通常许多不同的 URL 都指向同一页面。
http://example.com/
example.com
http://www.example.com/
http://example.com/index.html
http://example.com/。
https://example.com/
等等。
这对您来说可能是问题,也可能不是问题。
Often lots of different urls point to the same page.
http://example.com/
example.com
http://www.example.com/
http://example.com/index.html
http://example.com/.
https://example.com/
etc.
This might or might not be a problem for you.
MD5 被认为已被弃用 - 至少出于加密目的,但我建议仅使用 md5 来向后兼容现有内容。 当我们确实有其他尚未(至少还)被破坏的哈希算法时,您应该有充分的理由选择 md5。
我在该方法中看到的问题:
(如阿伦德提到的)
后者可能是重要的 - 这可以像删除和添加一样简单地完成。 也就是说,如果这些 id 在数据库外部永远不可见/不可存储。 (就像作为 URL 的组成部分一样。)
我想这些对于 DOI 来说不会是问题。
它如何与非自动编号整数 ID 设置一起工作,但脱机插入器代理在哪里创建数字? (也许可以使用专用的数字范围?)
如果两个用户独立添加相同的网址,可能会出现重复问题吗?
MD5 is considered deprecated - at least for cryptographic purposes, but I would suggest only using md5 for backwards compatibility with existing stuff. You should have a good reason to go with md5 when we do have other hash algos out there that aren't (at least yet) broken.
Problems I see with the approach:
(As arend mentioned)
The latter being the one that might be important - this could be done as simply as a remove and an add. That is, if these ids are never visible/storable outside the database. (Like as a component of a URL.)
I guess these won't be a problem for DOIs.
How would it work with a non-autonumber integer id setup, but where the offline inserter agent creates the numbers? (Can use a dedicated range of numbers, maybe?)
Might have a problem with duplication should two users independently add the same url?
md5 哈希几乎是唯一的,但不是完全唯一的,因此不要将其用作主键。 它因加密用途而折旧。 键冲突的可能性较小,但如果您有包含数十亿行的相当大的数据库,则仍然存在一定的冲突可能性。 如果您坚持使用哈希作为主键,请使用其他更好的哈希。 您不能对主键使用非唯一值。
如果您有很大的桌子,请不要使用它。 如果你的桌子很小,你可以使用它,但不推荐。
md5 hash is almost unique, but is not totally unique unique so don't use it as primary key. It is depreciated for cryptographic use. There is less chance of key collision, but if you have pretty big database with billions of rows, there is still some chance of collision. If you insist using hash as primary key use other better hash. You cannot use non unique values for Primary Key.
If you have pretty big table, don't use it. If you have small table, you might use it, but not recommended.