存储网站地址和页面 URL 的最佳主键是什么?
为了避免使用自动增量 id(它并不真正与数据绑定),我使用 URL 的 SHA1 签名作为主键来设计架构。
这种方法在很多方面都很有用:例如,我不需要从数据库中读取last_id,因此我可以准备所有表更新来计算键并在单个事务中进行真正的更新。没有违反约束。
无论如何,我读了两本书,它们告诉我我错了。在《高性能MySQL》中,据说随机密钥对数据库优化器不利。此外,在 Joe Celko 的每一本书中,他都说主键应该是数据的某些部分。
问题是:URL 的自然键是……URL 本身。事实是,如果站点很短 (www.something.com),则对 URL 没有强制限制(请参阅 http://www.boutell.com/newfaq/misc/urllength.html)。
考虑一下我必须存储(并使用)数百万个。
那么哪一个是最好的钥匙呢?自动增量 ID、URL、URL 哈希?
which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?
发布评论
评论(3)
您需要一个自动增量数字主键。当您需要传递 id 或连接其他表(例如,URL 的可选属性)时,您需要一些小的数字。
至于您想要哪些其他列和索引,一如既往,这取决于您将如何使用它们。
对于几乎所有使用大量 URL 的应用程序来说,存储每个 URL 的哈希值的列都是一个极好的主意。它使得通过全文选择 URL 的速度与它所能获得的速度一样快。第二个优点是,如果将该列设置为唯一,则无需担心存储实际 URL 的列唯一,并且可以使用 REPLACE INTO 和 INSERT IGNORE 作为简单、快速的原子写入操作。
我想补充一点,使用 MySQL 的内置 MD5() 函数就可以满足此目的。它唯一的缺点是专门的攻击者可以强制碰撞,我确信你不关心这一点。例如,使用内置函数可以使某些类型的连接变得更加容易。通过网络传递完整的 URL 可能会慢一点(“SELECT url FROM urls WHERE hash=MD5('verylongurl')”而不是“WHERE hash='32charhexstring'”),但您可以选择如果你愿意的话就可以这样做。除非您能想出 MD5() 会让您失望的具体场景,否则请随意使用它。
棘手的问题是您是否以及如何需要以全文以外的方式查找 URL:例如,您是否希望在任何“bar.com”主机上查找以“/foo”开头的所有 URL?虽然“LIKE '%bar.com%/foo%'”在测试中可以工作,但大规模时它会惨败。如果您的需求包括类似的内容,您可以想出创造性的方法来生成针对您需要的数据类型的非唯一索引...对于初学者来说,可能是一个domain_name 列。几乎可以肯定的是,您必须从应用程序中填充这些列(触发器和存储过程比它们在这里的价值要麻烦得多,特别是如果您关心性能的话——不必费心)。
好消息是关系数据库对于此类事情非常灵活。您始终可以添加新列并稍后填充它们。我建议初学者:int unsigned auto_increment 主键、唯一的哈希字符(32) 和(假设 64K 字符就足够了)文本 url。
You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.
想必您正在谈论整个 URL,而不仅仅是主机名,包括 CGI 参数和其他内容。
对 URL 进行 SHA-1 哈希处理会使所有密钥变得很长,并且使解决问题变得相当模糊。我不得不在哈希上使用一次索引来掩盖一些机密数据,同时保持连接两个表的能力,而且性能很差。
有两种可能的方法。一种是天真而明显的;另一种是天真而明显的。它实际上在 mySQL 中运行良好。它具有简单性等优点,并且能够使用 URL LIKE 'whatever%' 进行高效搜索。
但是,如果您有大量 URL 集中在几个域中……例如……
等,那么您所查看的索引仅在最后一个字符上有所不同。在这种情况下,您可能会考虑以相反的字符顺序存储和索引 URL。这可能会导致更有效地访问索引。
(Oracle 表服务器产品恰好有一种内置的方式,通过所谓的反向索引来执行此操作。)
如果我是您,我会避免使用自动增量键,除非您必须连接两个以上的表
ON TABLE_A.URL = TABLE_B.URL
或具有这种含义的其他连接条件。Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables
ON TABLE_A.URL = TABLE_B.URL
or some other join condition with that kind of meaing.取决于你如何使用该表。如果您主要使用
WHERE url=''
进行选择,那么使用一列表就可以了。如果您可以使用自动增量 id 来标识应用程序中所有位置的 URL,那么请使用自动增量 idDepends on how you use the table. If you mostly select with
WHERE url='<url>'
, then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement