聚集索引必须是唯一的吗?
如果聚集索引不唯一会发生什么?由于插入的行流向某种“溢出”页面,是否会导致性能不佳?
它是否是独一无二的?如果是的话,是如何做到的?让它独一无二的最佳方法是什么?
我这样问是因为我目前正在使用聚集索引将表划分为逻辑部分,但性能一般般,最近我得到 使我的聚集索引独一无二的建议。我想对此有第二意见。
What happens if a clustered index is not unique? Can it lead to bad performance because inserted rows flow to an "overflow" page of some sorts?
Is it "made" unique and if so how? What is the best way to make it unique?
I am asking because I am currently using a clustered index to divide my table in logical parts, but the performance is so-so, and recently I got the advice to make my clustered indexes unique. I'd like a second opinion on that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
它们不一定要独一无二,但我们肯定会鼓励这样做。
我还没有遇到过想要在非唯一列上创建 CI 的场景。
如果您在非唯一列上创建 CI,会发生什么情况
这会导致性能不佳吗?
添加唯一符肯定会增加计算和存储它的一些开销。
这种开销是否显着取决于几个因素。
编辑
正如 Remus 在评论中指出的那样,确实存在创建非唯一 CI 是合理选择的用例。我没有遇到过这些情况,仅仅表明我自己缺乏接触或能力(选择你的选择)。
They don't have to be unique but it certainly is encouraged.
I haven't encountered a scenario yet where I wanted to create a CI on a non-unique column.
What happens if you create a CI on a non-unique column
Does this lead to bad performance?
Adding a uniqueifier certainly adds some overhead in calculating and in storing it.
If this overhead will be noticable depends on several factors.
Edit
as been pointed out by Remus in comments, there do exist use cases where creating a non-unique CI would be a reasonable choice. Me not having encountered one off those scenarios merely shows my own lack of exposure or competence (pick your choice).
我想看看索引女王 Kimberly Tripp 对此主题的看法:
我将从我对聚类键的推荐开始 - 出于几个原因。首先,这是一个容易做出的决定,其次,尽早做出这个决定有助于主动防止某些类型的碎片。如果您可以防止某些类型的基表碎片,那么您可以最大限度地减少一些维护活动(其中一些在 SQL Server 2000 中,在 SQL Server 2005 中较少)需要表处于脱机状态。好的,我稍后会讨论重建内容......
让我们从我在集群键中寻找的关键内容开始:
为什么是唯一的?
聚集键应该是唯一的,因为聚集键(如果存在)用作所有非聚集索引的查找键。以一本书后面的索引为例 - 如果您需要查找索引条目指向的数据 - 该条目(索引条目)必须是唯一的,否则,哪个索引条目将是您要查找的数据?因此,当您创建聚集索引时 - 它必须是唯一的。但是,SQL Server 不要求在唯一列上创建集群键。您可以在任何您想要的列上创建它。在内部,如果集群键不唯一,那么 SQL Server 将通过向数据添加 4 字节整数来“唯一化”它。因此,如果聚集索引是在不唯一的东西上创建的,那么不仅在索引创建时会产生额外的开销,还会浪费磁盘空间,在插入和更新上产生额外的成本,并且在 SQL Server 2000 中,聚集索引会产生额外的成本重建(由于集群键选择不当,现在更有可能发生这种情况)。
来源: 不断增加的集群键争论 - 再次!
I like to check out what The Queen of Indexing, Kimberly Tripp, has to say on the topic:
I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....
Let's start with the key things that I look for in a clustering key:
Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).
Source: Ever-increasing clustering key debate - again!
他们不这样做,有时如果不这样做会更好。
考虑一个具有半随机、唯一的 EmployeeId 和每个员工的 DepartmentId 的表:如果您的 select 语句是,
那么如果
DepartmentId
是聚集索引,那么性能最佳,即使(甚至特别是因为)它不是唯一索引(最适合性能,因为它确保给定 DepartmentId 内的所有记录都聚集)。例如,有聚集索引设计指南,其中写道:
例如,我对“高度唯一性”的理解是,如果大多数查询想要选择给定城镇内的记录,那么选择“国家/地区”作为聚集索引并不好。
They don't, and there are times where it's better if they're not.
Consider a table with a semi-random, unique EmployeeId, and a DepartmentId for each employee: if your select statement is
then it's best for performance if the
DepartmentId
is the clustered index even though (or even especially because) it's not the unique index (best for performance because it ensures that all the records within a given DepartmentId are clustered).There's Clustered Index Design Guidelines for example, which says,
My understanding of "high degree of uniqueness" for example is that it isn't good to choose "Country" as the clusted index if most of your queries want to select the records within a given town.
关于“聚集索引必须是唯一的吗?”
不!
想到这个情况,
您有 100 条记录
您想要记录 ID = 50-59(ID 值)
聚集索引扫描记录,直到找到ID=50的记录
它收集记录,直到达到 ID = 60 并停止(聚集索引知道 59 之后将不再有记录)
的一个特例
所以集群可以被认为是 ORDER BY NOW
如果您的表有一个 ID 列来使记录唯一,并且有一个 UID 来指示插入记录的人,您可以将其聚集在 UID 上,以便前端可以通过 UID 请求记录,而 PK 就是 ID。
这种情况取决于您将如何使用数据。
On "Do clustered indexes have to be unique?"
No!
Think of this situation,
You have 100 records
You want records ID = 50-59 (ID values)
The clustered index scans the records until it finds ID = 50
It collects the records until it hits ID = 60 and Stops (a clustered index knows there will not be anymore records after 59)
So the clustering can be thought of as a special case of ORDER BY
NOW
If your table has an ID column to make the records unique and a UID for who inserts the record you could cluster it on the UID so the front end can request records by the UID while the PK is the ID.
This case depends on how you are going to use the data.
如果您正在调整旧数据库,那么这是天赐之物。我正在研究一个有 20 年历史的数据库的性能问题。它具有 3 - 8 列的非聚集 PK。我可以选择分布广泛的一列,而不是使用所有 8 列都是唯一的,并且它应用了 Uniqueifier。它是一个 Int,但通过使用像项目 ID 这样的列,它可以处理 2147483647 个唯一的项目 ID,这对于大多数用例来说已经足够了。如果还不够,请向簇中添加第二列或第三列。
这无需在应用层进行任何编码修改即可工作。 20 年的生产和管理经验并不需要进行重大重写。
If you are tuning an old DB this is a Godsend. I am working on Perf issues on a 20-year-old DB. It has nonclustered PKs with 3 - 8 columns. Instead of using all 8 columns to be unique I can pick one column with broad distribution, and it applies a Uniqueifier. It is an Int but by using a column like Project ID it can handle 2147483647 unique projectIDs which is enough for most use-cases. If it is not enough add a second or third column to the cluster.
This works without any coding modification in the App layer. 20 years in production and management doesn't have to order a major rewrite.