无法直接访问聚集表中的数据行 - 为什么?
[11] 告诉:
“在非聚集索引中,叶级不包含所有数据。除了键值之外,叶级(树的最低级别)中的每个索引行还包含一个告诉 SQL Server 在哪里可以找到与索引中的键对应的数据行的书签。
书签可以采用两种形式之一。 如果表有聚集索引,则书签是对应数据行的聚集索引键
。如果表是堆(换句话说,它没有聚集索引),则书签是行标识符 (RID),它是 File#:Page#:Slot# 形式的实际行定位符。"
” 索引键或非聚集索引的副本是否有一个指向它的指针?
是否应该遍历所有聚集索引结构,即带有中间数据的b树,以通过聚集表上的非聚集索引书签来获取行数据?
聚集 直接引用变得不可能?
更新:
让我重新表述一下这个问题。这是如何完成的我可以自己阅读,但我想了解为什么要这样做。
继续通过 RID 从具有(添加的)聚集索引的非聚集索引引用行数据不是会更高效吗?
假设一个表只有非聚集索引(但没有聚集索引)。
非聚集索引叶包含真实数据的 RID。用于直接访问行数据,无需任何查找/遍历。
添加聚集索引意味着消除 IAM(索引分配映射)页面,并用聚集索引键替换所有非聚集索引的所有 RID + 需要额外查找而不是直接访问。
这有什么意义呢?
更新2:
我的问题是否被微软本人否决了?再次感谢,这是一种荣幸。
在不解释的情况下投反对票是没有意义的。
更新3:
@PerformanceDB”,我无法理解您回答中的这句话:
“这也意味着 B 树的索引高度减少了一级(这就是为什么如果你检查它们的话它们很小的原因)。”
你能解释一下吗?
是的,我想要插图。
我开始阅读:< a href="http://blogs.sqlserver.org.au/blogs/greg_linwood/archive/2006/09/25/387.aspx" rel="nofollow noreferrer">揭穿有关聚集索引的神话 - 第 4 部分(CIX、 TPC-C 和 Oracle 集群) 和许多其他来源一样,它明确指出 SQL Server 与 Oracle 不同,缺乏对集群表的直接访问功能
Update4 (Update5 -通过删除线更正):
一些回答者提到了这样一个事实:NCI 叶子中的书签 CI 密钥是为了在页面拆分的情况下实现地址独立。
在使用 CI NCI(非聚集索引)对非聚集表进行索引重组或碎片整理期间,是否会重新定位行以及 NCI 中相应的 RID NCI 中的更改修改的?
在我看来,这似乎是解决方案缺陷 - 该行应该随其地址一起移动,不是吗?
另外,堆是否完全不受页面拆分的影响?由于行中可变大小数据类型的大小增加
相关问题:
引用:
[11]
Microsoft® SQL Server™ 2005 内部:存储引擎
作者:Kalen Delaney -(扎实的优质学习)
................................................
出版商:微软出版社
发布日期:2006 年 10 月 11 日
打印 ISBN-10:0-7356-2105-5
打印 ISBN-13:978-0-7356-2105-3
页数:464
[11a] p.250 第 7 章中的索引组织。索引内部结构和管理
这里是有用的在线复制粘贴
http://sqlserverindexeorgnization.blogspot.com/
尽管没有任何来源来源
[11] tells:
"In a nonclustered index, the leaf level does not contain all the data. In addition to the key values, each index row in the leaf level (the lowest level of the tree) contains a bookmark that tells SQL Server where to find the data row corresponding to the key in the index.
A bookmark can take one of two forms. If the table has a clustered index, the bookmark is the clustered index key for the corresponding data row
. If the table is a heap (in other words, it has no clustered index), the bookmark is a row identifier (RID), which is an actual row locator in the form File#:Page#:Slot#."
Is this a copy of clustered index key or nonclustered index has a pointer to it?
Should all clustered index structure, i.e. b-tree with intermediate data, be traversed to get to row data through non-clustered index bookmark on clustered table?
What does clustered index indtroduce that direct referencing become impossible?
Update:
Let me re-phrase the question. How this is done I can read myself but I want to understand why it is done this way.
Would not it be much more efficient to continue referencing row data by RID from non-clustered index having (added) clustered one?
Suppose a table has only non-clustered index(es) (but no clustered index).
Non-clustered index leaves contains RID to real data. For direct access of row data without any need of lookup/traversals.
Adding clustered index means elimination of IAM (Index Allocation Map) pages and substituting all RIDs of all non-clustered indexes by clustered index keys + necessity of additional lookup instead of direct access.
What is the point in this?
Update2:
Was my question downvoted by Microsoft himself? Thanks again, this is an honor.
It is pointless to downvote without explaining.
Update3:
@PerformanceDB", I could not understand the phrase in your answer:
""It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them)."
Can you explain it?
Yes, I'd like illustrations.
I started to read: Debunking myths about clustered indexes - part 4 (CIXs, TPC-C & Oracle clusters) and it, as many other sources, explicitly refer to the fact that SQL Server, in contrat to Oracle, lacks direct access features on a clustered table.
Update4 (Update5 - corrected by strike-out):
A few answerers referred to the fact that a bookmark CI key in NCI leaf is for address independence in case of page splits.
Aren't during index reorganization or de-fragmenting in non-clustered table with CI NCI (non-clustered index) the rows relocated and corresponding RIDs in NCI change in NCI are modified?
This seems to me as addressing scheme deficiency - the row should have moved with its address, should have not it?
Also, Is heap completely immune to page splits? due to size increase of variable size data types in row
Related questions:
- What do I miss in understanding the clustered index?
- Why/when/how is whole clustered index scan chosen rather than full table scan?
- Reasons not to have a clustered index in SQL Server 2005
Cited:
[11]
Inside Microsoft® SQL Server™ 2005: The Storage Engine
By Kalen Delaney - (Solid Quality Learning)
...............................................
Publisher: Microsoft Press
Pub Date: October 11, 2006
Print ISBN-10: 0-7356-2105-5
Print ISBN-13: 978-0-7356-2105-3
Pages: 464
[11a]
p.250 Section Index organization from Chapter 7. Index Internals and Management
Here is helpful online copypaste from it
http://sqlserverindexeorgnization.blogspot.com/
though without any credits to source
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
问题在于该文档过于冗长,并且增加了它声称它已澄清的混乱。如果你忘记所有这些并重新开始,它实际上是非常简单的。由于您正在询问数据存储结构,并关心性能,让我们看看这个角度(而不是逻辑)。不存在称为“表”的数据存储结构。
堆
包含行的数据页。没有聚集索引。行不会因插入/删除而移动。可以整体读取行(表扫描)或单独读取行(通过非聚集索引)。它变得严重碎片化。
聚集索引
B 树。索引与数据行聚集。叶级别是数据行。这意味着每次访问都会减少一次 I/O。这也意味着 B 树的索引高度减少了一级(这就是为什么如果你检查它们的话它们很小的原因)。堆(整个数据存储结构)被消除。没有指针。行按聚集索引键顺序维护(行由于插入/删除/扩展而在页面上移动)。页面在范围内被修剪。
非聚集索引
B 树。根据行数要求的全高。
如果有聚集索引,叶子级别就是聚集索引键(这样它就可以到达 CI 中的确切位置,即行)。
在没有聚集索引的情况下,叶子级别是一个指针:File:Page:Offset(以便它可以转到堆,并获取行)。堆中的 RowId 不会更改(如果更改了,每次插入/删除一行时,您都必须更新所有关联 NCI 中的所有 NCI 条目,对于所有其他 /em> 页面上的行)。
这就是为什么当您创建 CI 时,所有 NCI 都会自动重建(它们必须从 [2] 切换到 1)。显然,始终在 NCI 之前创建 CI。
没有File:Page:Slot,行长度是可变的,它是Page内的Offset。
没有书签或其他古书。
“无法直接访问聚集表中的数据行 - 为什么”
无稽之谈。您可以通过 CI(少一个 I/O)或 NCI⇢CI 键直接立即访问每个数据行。
这是非常快的,是 Britton Lee 发明的;由 Sybase 重新实施并获得专利;达斯·维达以不诚实的方式以微薄的代价获得了它。
如果您需要进一步说明,我可以提供插图。
对评论的回应
“这也意味着 B 树的索引高度减少了一级(这就是为什么如果您检查它们,它们会很小)。”
假设您有一个包含 10 亿行的表。垂直绘制的任何给定索引(例如,唯一的、PK 上的)的 B 树的“高度”为 8;或者您可以说索引的深度为 8 层,位于顶部(单个条目)和底部(叶级)之间。叶层当然是最宽、人口最多的;它将有 10 亿个条目。假定每个索引页包含 256 个条目,则叶子减一级别包含 390K 个条目。
CI B 树(仅索引部分)将包含 7 个级别、390K 条目,占用 10MB;因为叶级别是数据行(其中有 10 亿个条目,很好地分布在 100GB 上),因此被排除或不重复。
是的,我想要插图。
好的。我有一套已完成的 Sybase 文档;为了避免混淆,我为您精心挑选了一份,并排除了 Sybase 有而 MS 没有的部分。对不起。不要点击链接,只停留在一页上。此外,堆中的碎片水平非常低,这与 Sybase 和 MS 中的堆碎片很大不同,所以我保持原样。
数据存储基础知识
“我开始阅读:揭穿有关集群索引的神话 - 第 4 部分(CIX、TPC-C 和 Oracle 集群),它与许多其他来源一样,明确指出 SQL Server 与 Oracle 不同,缺乏对集群表的直接访问功能。”
请小心阅读,网络上充满了肤浅的信息;断章取义地讨论半真半假的内容;错误信息(来自供应商以及善意的无知者)。正如你所注意到的,我只是回答问题;我不会浪费时间回答参考文献中提出的观点。
想想这个。实施良好的带有 CI 的表不需要碎片整理;当实施得不好时,需要不频繁的碎片整理;没有 CI 的表需要频繁且几乎离线的碎片整理。这是持续到周一早上的维护窗口。这只是一个例子,说明为什么孤立地讨论项目实际上是错误信息。这就是为什么我的文档都是相互链接和相关的。
“一些回答者提到 NCI 叶中的 CI 密钥是为了在页面拆分时实现地址独立。”
是的,我只是不会这么说,这和第一个参考文献一样令人困惑你发布了。页面拆分与此无关。为了清楚起见,我故意采用了我在上面的帖子中所做的方式。由于行移动(CI 保持页面和范围修剪),NCI 必须具有 CI 密钥,以便找到该行。它不能使用一直在变化的 RowId。除非你有宽 CI 密钥,否则这没什么大不了的; 4 字节 RowId(加上处理开销)与 8 字节 CI 密钥(减去所述开销)......谁在乎(好吧,也许是你)。解决较高级别的问题,而低级别的问题将小到不需要解决。当您的数据库碎片化且未规范化时,在低级别上挤出 1% 的性能改进是非常愚蠢的。
系统由一组集成的组件组成,任何组件都不能单独更改或评估。一堆没有集成的组件是解体的,而不是一个系统。以你的提问水平,你还不能下结论,或者对这个或那个有怨恨,如果你这样做了,那是过早的结论和怨恨,会阻碍你的进步。最重要的是,通过问答获得的知识与通过阅读加经验获得的知识之间存在很大差异。
“在使用 CI 对非聚集表进行索引重新组织或碎片整理期间,难道 NCI 中的行已重新定位并且 NCI 中相应的 RID 不会发生变化吗?”
您的意思是“使用 CI 的非聚集索引” ? NCI 不值得进行碎片整理,只需删除/创建它们即可。
或者您的意思是“对 CI [整个表]进行碎片整理”?我已经发布过,当您重新创建 CI(或对其进行碎片整理)时,NCI 会自动重建。这与 RowId 无关,而是与更改有关:当您删除 CI 时,必须将 NCI 从 CI 键重写为 RowIds;创建 CI 时,NCI 必须更改回 CI 密钥。开启的 DBA 在删除 CI 之前先删除 NCI。
“在我看来,这解决了方案缺陷 - 该行应该随其地址一起移动,不是吗?”
你的层次太低了,不了解更高的层次。如果该行移动,其地址就会改变;如果地址改变,该行就会移动。要么你有一个 CI(行移动),要么你有一个堆(行不移动)。
“此外,堆是否完全不受页面拆分的影响?”
不会。当可变长度行扩展并且页面上没有空间时,页面拆分仍然会发生。但在计划中,由于从不移动行,堆上存在大量碎片,由于它是基于 RowId 的(NCI 所依赖的),所以这是一个小项目。
The problem is that the doco is gobbledegook, and increases the very confusion it is alleging it clarifies. If you forget about all that and start again, it is actually quite straight-forward. Since you are inquiring re data storage structures, and concerned re performance, let's look at that perspective (not the logical). There is no data storage structure caled "Table".
Heap
Data pages containing rows. There is no Clustered Index. The rows are not shifted as a result of Inserts/Deletes. The rows can be read in entirety (table scan) or singly (via a NonClustered Index). It gets badly fragmented.
Clustered Index
B-Tree. The Index is Clustered with the Data Rows. The leaf level is the data row. That means one less I/O on every access. It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them). The Heap (entire data storage stucture) is eliminated. There are no pointers. The rows are maintained in Clustered Index Key order (rows are moved on the page as a result of Insert/Delete/Expand). Pages are trimmed within the extents.
NonClustered Index
B-Tree. Full height as required by number of rows.
Where there is a Clustered Index, the Leaf level is the Clustered Index Key (so that it can go to the exact location in the CI, which is the row).
Where there is no Clustered Index, the Leaf level is a pointer: File:Page:Offset (so that it can go to the Heap, and get the row). The RowIds in the Heap do not change (if they did, every time you Inserted/Deleted one row, you would have to update all the NCI entries in all associated NCIs, for all other rows on the page).
That is why, when you create a CI, all NCIs are automatically rebuilt (they have to be switched from [2] to 1). Obviously, always create the CI before the NCIs.
There is no File:Page:Slot, the row length is variable, it is Offset within Page.
There is no Bookmark or other goobledegook.
Re "No direct access to data row in clustered table - why"
Nonsense. You have direct and immediate access to each data row, via the CI (one less I/O) or the NCI⇢CI Key.
This is very fast, invented by Britton Lee; re-implemented and patented by Sybase; obtained by dishonesty and for a pittance by Darth Vader.
If you need further clarification, I can provide illustrations.
Responses to Comments
"It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them)."
Let's say you have a tables with 1 billion rows. The "height" of the B-Tree of any given index (eg. Unique, on PK) drawn vertically is say 8; or you can say the index is 8 levels deep, between the top (a single entry) and the bottom, the leaf level. the leaf level is of course the widest, and most polpulated; it will have 1 billion entries. Given that each index page contains say 256 entries, the leaf-minus-one level contains 390K entries.
The CI B-tree (index only portion) will contain 7 levels, 390K entries, taking 10MB; because the leaf level IS the data row (of which there are 1 billion entries, spread nicely across 100GB), and is thus excluded, or not repeated.
Yes, I'd like illustrations.
Ok. I have a set of finished Sybase docs; I have butchered one for you, so as to avoid confusion, and excluded the bits that Sybase has, that MS does not. Sorry. Don't follow the links, just stay on the one page. Also the very low levels of Fragmentation in the heap are different by the fragmentation in the Heap is massive, in both Sybase and MS, so I have left that intact.
Data Storage Basics
"I started to read: Debunking myths about clustered indexes - part 4 (CIXs, TPC-C & Oracle clusters) and it, as many other sources, explicitly refer to the fact that SQL Server, in contrat to Oracle, lacks direct access features on a clustered table."
Be careful what you read, the web is full of superficial information; half truths discussed out of context; misinformation (from vendiors as well as well-meaning ignorants). As you notice, I just answer questions; I do not waste time answering points raised in references.
Just think about this. Well-implemented Tables with a CI do not need de-fragmentation; and when implemented badly, need infrequent de-fragmentation; tables without a CI need frequent and pretty much offline de-fragmentation. That's your maintenance window running into Monday morning. Just an example of why discussing items in isolation is actually misinformation. Which is why my docs are all linked and related to one another.
"A few answerers referred to the fact that CI key in NCI leaf is for address independence in case of page splits."
Yes, I just would not put it that way, that's as confusing as the first reference you posted. Page splits have nothing to do with it. I put is the way I did in my post above on purpose, for clarity. Since the rows move (the CI keeps the pages and Extents trim), the NCI MUST have the CI key, in order to find the row. It can't use a RowId which would keep changing all the time. Unless you have wide CI keys, this is no big deal; a 4-byte RowId (plus processing overhead) vs an 8-byte CI key (minus said overhead) ... who cares (ok, maybe you). Address the higher level issues, and the low level issues will be small enough to not warrant address. Squeezing out 1% performance improvement at the low level when your db is fragmented and unnormalised, is amore than a bit silly.
A system in an integrated set of components, none can be changed or evaluated in isolation. A bunch of components that are not integrated are dis-integrated, not a system. At your level of questioning, you are not yet in a position to form conclusions, or have grudges against this or that, if you do, they are premature conclusions and grudges, that will impede your progress. On top of that, there is a big difference between knowledge gained by question-and-answer vs knowledge gained by reading plus experience.
"Aren't during undex reotganization or defragmanting of non-clustered table with CI the rows relocated and corresponding RIDs in NCI change in NCI?"
Do you mean "non-clustered INDEX with CI" ? Well the NCIs are not worth de-fragmenting, just drop/create them.
Or do you mean "de-fragmenting a CI [whole table]" ? I have already posted, when you re-create the CI (or de-fragment it in place), the NCIs are automatically rebuilt. It is not about RowIds, it is about the change: when you drop the CI, the NCIs have to be rewritten from CI keys to RowIds; when you create the CI, the NCIs have to the changed back to CI Keys. Switched on DBAs drop the NCI before dropping the CI.
"This seems to me as addressing scheme deficiency - the row should have moved with its address, should have not it?"
You're getting too low-level without understanding the higher levels. If the row moves, its address changes; if the address changes, the row moves. Either you have a CI (rows move) xor you have a Heap (rows do not move).
"Also, Is heap completely immune to page splits?"
No. Page Splits still happen when variable length rows expand and there is no room on the page. But in the scheme of things, massive fragmentation on Heaps, due to never moving rows, due to it being RowId based (which the NCIs rely on), this is a small item.
不!如果表有插入,并且发生页面分割,那么您可能必须更新大量使用 RID 的引用来指向已被插入的数据行的新位置。移至 SQL Server 中的新页面。这正是 SQL Server 团队选择使用聚集键的原因,可以这么说,将其作为“数据指针”。当发生页面拆分时,集群键的值不会改变,因此不需要更新索引。
NO ! If a table has an insert, and a page split occurs, then you would have to potentially update a lot of references that use a RID to point to the new locations of those data rows that have been moved to a new page in SQL Server. That's exactly why the SQL Server team chose to use the clustering key instead, as the "data pointer", so to speak. The value of the clustering key does not change when a page split occurs, so no update to indices are needed.
聚集索引的全部要点是记录是通过逻辑定位器(通常不会改变)而不是物理来访问的。
如果索引指向物理 RID 并且行更改了其物理位置(例如从页拆分),则所有索引也需要更新。
这正是聚集索引的发明来解决的问题。
The whole point of a clustered index is that the records are accessed via a logical locator (which is not normally meant to change), not physical.
If the indexes were pointing to a physical
RID
and a row changed its physical location (say from a page split), all indexes would need to be updated too.It's exactly the kind of problem the clustered indexes were invented to deal with.
如果表具有聚集索引,则每个非聚集索引行都包含聚集索引键的副本。
如果表没有聚集索引,即表是堆,则每个非聚集索引行都包含一个根据文件标识符(ID)、页号和页上的行号构建的指针。整个指针称为行 ID (RID)。
当您使用聚集索引标识(选择)一行时,您将获得该行的所有列。
当您识别非聚集索引中的行时,您需要执行另一个查找步骤来获取非聚集索引中未包含的列。
If a table has a clustered index, each non-clustered index row contains a copy of the clustered index key.
If a table does not have a clustered index, i.e. the table is a heap, each non-clustered index row contains a pointer built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).
When you identify (select) a row using a clustered index, you have all the columns from the row.
When you identify a row in a non-clustered index, you need to perfrom another lookup step to obtain the columns not included in the non-clustered index.
是的,在许多情况下这会更有效。我相信聚集索引最初就是这样实现的(在 6.0 版本中?)。据推测,它们的更改是由于 marc_s 提到的原因,如果您的聚集索引具有大量页面拆分,则这是有意义的。
In many cases it would be more efficient, yes. I believe that clustered indexes were originally implemented that way (in version 6.0?). Presumably they were changed for the reasons that marc_s mentioned, which make sense if your clustered index is such that it has a lot of page splits.
我不会发布这个(我的)问题,在我在这里发布之前我见过AlexSmith 的回答,我在发布后几分钟就看到了,并且已经在这里得到了回答:
遗憾的是,不可能在这里接受它作为答案
更新:
PerformanceDBA 在这里接受了答案 告诉:“问题在于 doco 是冗长的,并且增加了它声称它澄清的混乱”
好吧,所有 msdn 文档都告诉并显示,例如,参见。图片来自聚集索引结构与"堆结构" 表明聚集表没有 IAM 页面。同时,按照 Inside the Storage Engine:Using DBCC PAGE 的代码输出DBCC IND 查明页面拆分是否回滚 显示相反的结果。
由于不想继续在这里发送垃圾邮件,我将我的提问和参与转移到 /www.sqlservercentral.com/Forums
我的相关问题:
I would not have posted this (my) question, have I seen before my posting here that answer by AlexSmith there, which I saw just a few minutes after posting and having been already answered here:
It is pity, it is impossible to accept it there as an answer here
Update:
The accepted here answer by PerformanceDBA told: "The problem is that the doco is gobbledegook, and increases the very confusion it is alleging it clarifies"
Well, all msdn docs tell and show, for ex., cf. pictures from Clustered Index Structures vs. "Heap Structures" that clustered table does not have IAM page. Meanwhile, the output from following the code from Inside the Storage Engine: Using DBCC PAGE and DBCC IND to find out if page splits ever roll back shows the opposite.
Having no desire to continue spamming here I shifted my questioning and participation to /www.sqlservercentral.com/Forums
My related question there: