id 列或聚集主键/数据库一致性
如果我有一个包含以下列的表:
- 艺术家
- 专辑
- 歌曲
- NumberOfListens
...最好在艺术家、专辑和歌曲上放置一个聚集主键,或者有一个自动增量 id 列并对艺术家、专辑和歌曲放置唯一约束歌曲。
数据库一致性有多重要?如果我的一半表具有聚集主键,另一半是具有唯一约束的 id 列,那么这很糟糕还是无关紧要?两种方式对我来说似乎都是一样的,但我不知道行业标准是什么,或者哪种更好以及为什么。
If I had a table with the columns:
- Artist
- Album
- Song
- NumberOfListens
...is it better to put a clustered primary key on Artist, Album, and Song or to have an autoincrementing id column and put a unique constraint on Artist, Album, and Song.
How important is database consistency? If half of my tables have clustered primary keys and the other half an id column with unique constraints, is that bad or does it not matter? Both ways seem the same to me but I do not know what the industry standard is or which is better and why.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我永远不会在长文本列上放置主键,例如:艺术家、专辑和歌曲。使用作为集群 PK 的自动增量 ID。如果您希望艺术家、专辑和歌曲是唯一的,请在这三者上添加唯一索引。如果您想按专辑或歌曲进行搜索,而不依赖于独立的艺术家,您将需要为每个专辑或歌曲提供一个索引,这会拉动 PK,因此拥有较小的 PK 可以节省您在每个其他索引上的时间。节省的不仅仅是磁盘空间,还包括内存缓存以及页面上的更多键。
I would never put a primary key on columns of long text like: Artist, Album, and Song. Use an auto increment ID that is the clustered PK. If you want the Artist, Album, and Song to be unique, ad an Unique Index on the three. If you want to search by Album or Song, independent of independent Artist, you'll need an index for each, which pulls in the PK, so having a small PK saves you on each other index. The savings are not just disk space but in memory cache, and more keys on a page.
您确实需要区分两个问题:
1)主键是一个逻辑构造 - 唯一且可靠地标识表中每一行的候选键之一。这实际上可以是任何东西——一个 INT、一个 GUID、一个字符串——选择对你的场景最有意义的。您在外键约束中引用主键,因此这些对于数据库的完整性至关重要。永远使用它们。
2)聚集键(定义表上“聚集索引”的一列或多列) - 这是一个物理与存储相关的东西,这里是一个小的,独特、稳定、不断增加的数据类型是您的最佳选择 - INT 或 BIGINT 作为您的默认选项。
默认情况下,SQL Server 表上的主键也用作集群键 - 但不一定如此,您可以轻松选择不是主键的列作为集群键。
然后还有另一个问题需要考虑:表上的聚集键也将添加到表上每个非聚集索引的每个条目中 - 因此您确实希望确保它尽可能小。通常,具有 2+ 十亿行的 INT 对于绝大多数表来说应该足够了 - 与 VARCHAR(20) 左右作为集群键相比,您可以在磁盘和服务器内存中节省数百兆字节的存储空间。
还有一些值得深思的东西 - Kimberly Tripp 写的很棒的东西 - 读它,再读它,消化它!这确实是 SQL Server 索引的福音。
Marc
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario. You reference primary keys in your foreign key constraints, so those are crucial for the integrity of your database. Use them - always - period.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, unique, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way, you can easily pick a column that is not your primary key to be your clustering key.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a VARCHAR(20) or so as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
Marc
聚集索引非常适合基于范围的查询。例如,日志日期或订单日期。当您插入新行时,在“艺术家”、“专辑”和“歌曲”上放置一个[可能]会导致碎片。
如果您的数据库支持它,请在艺术家、专辑和歌曲上添加非聚集主键并称其为良好。或者只是在艺术家、专辑和歌曲上添加唯一的键。
仅当您必须具有对另一个表的引用完整性时,拥有自动增量主键才会真正有用。
Clustered indexes are great for range based queries. For example, a log date or order date. Putting one on Artist, Album, and Song will [probably] cause fragmentation when you insert new rows.
If your DB supports it, add a non-clustered primary key on Artist, Album, and Song and call it good. Or just add a unique key on Artist, Album, and Song.
Having an autoincrementing primary key would only really be useful if you had to had referential integrity to another table.
在不知道确切要求的情况下,通常您可能会有一个艺术家表,也可能有专辑表。歌曲表将是艺术家 ID、专辑 ID 和歌曲的唯一组合。我将根据应用程序通过索引或约束强制执行唯一性,并使用 id 作为主键。
Without knowing the exact requirements, in general you would probably have an artist table, and possibly album table too. A song table would then be a unique combination of artist id, album id and then song. I'd enforce the uniqueness by an index or constraint depending on application, and use an id for a primary key.
首先,这里已经存在一个问题,因为数据没有标准化。应尽可能避免在一堆文本列上创建任何类型的索引。即使这些列不是文本(我怀疑它们是),将艺术家、专辑和歌曲放在同一个表中仍然没有意义。一个更好的设计是:
一旦你有了这个设计,你就可以搜索个人专辑和艺术家以及歌曲。您还可以添加覆盖索引来加快查询速度,索引将小得多,因此比原始设计更快。
如果您不需要进行范围查询(您可能不需要),那么您可以将
IDENTITY
键替换为ROWGUID
(如果这更适合您的设计);在这种情况下,这并不重要,我会坚持使用简单的 IDENTITY。您必须小心集群键。如果你聚集在一个完全不连续的键上(艺术家、专辑和歌曲名称绝对符合非连续的条件),那么你最终会出现页面分割和其他令人讨厌的情况。你不想要这个。正如 Marc 所说,这个密钥的副本会添加到每个索引中,当您的密钥长度为 300 或 600 字节时,您肯定不希望这样。
如果你希望能够通过艺术家、专辑和歌曲名快速查询某首特定歌曲的收听量,通过上面的设计,其实很简单,你只需要正确索引即可:
现在这个查询会很快:
如果你检查执行计划,你会看到 3 个索引查找 - 这是你能得到的最快速度。我们保证了与原始设计完全相同的独特性,并针对速度进行了优化。更重要的是,它是标准化的,因此艺术家和专辑都有自己的特定身份,这使得从长远来看更容易管理。搜索“艺术家 X 的所有专辑”要容易得多。搜索“专辑 Y 上的所有歌曲”更加更加容易和快捷。
设计数据库时,规范化应该是您首先考虑的问题,索引应该是第二个问题。您可能会发现,一旦您有了规范化的设计,最佳的索引策略就变得显而易见了。
First of all, there's already a problem here because the data is not normalized. Creating any sort of index on a bunch of text columns is something that should be avoided whenever possible. Even if these columns aren't text (and I suspect that they are), it still doesn't make sense to have artist, album and song in the same table. A much better design for this would be:
Once you have this design, you have the ability to search for individual albums and artists as well as songs. You can also add covering indexes to speed up queries, and the indexes will be much smaller and therefore faster than the original design.
If you don't need to do range queries (which you probably don't), then you could replace the
IDENTITY
key with aROWGUID
if that suits your design better; it doesn't really matter much in this case, I would stick with the simpleIDENTITY
.You have to be careful with clustering keys. If you cluster on a key that is completely not even remotely sequential (and an artist, album, and song name definitely qualify as non-sequential), then you end up with page splits and other nastiness. You don't want this. And as Marc says, a copy of this key gets added to every index, and you definitely don't want this when your key is 300 or 600 bytes long.
If you want to be able to quickly query for the number of listens for a specific song by the artist, album, and song name, it's actually quite simple with the above design, you just need to index properly:
Now this query will be fast:
If you check out the execution plan you'll see 3 index seeks - it's as fast as you can get it. We've guaranteed the exact same uniqueness as in the original design and optimized for speed. More importantly, it's normalized, so both an artist and an album have their own specific identity, which makes this a great deal easier to manage over the long term. It's much easier to search for "all albums by artist X." It's much much easier and faster to search for "all songs on album Y."
When designing a database, normalization should be your first concern, indexing should be your second. And you're likely to find that once you have a normalized design, the best indexing strategy becomes kind of obvious.