聚集索引对数据库性能的影响

发布于 2024-09-10 08:40:39 字数 423 浏览 4 评论 0原文

我最近参与了一个新的软件项目,该项目使用 SQL Server 2000 进行数据存储。

在审查该项目时,我发现其中一个主表在其主键上使用了聚集索引,该索引由四列组成:

Sequence  numeric(18, 0)
Date      datetime
Client    varchar(9)
Hash      tinyint

该表在正常操作过程中经历了大量插入。

现在,我是一名 C++ 开发人员,而不是数据库管理员,但我对此表设计的第一印象是,将这些字段作为聚集索引将对插入性能非常不利,因为数据必须在物理上重新排序每个插入。

此外,我真的看不出这样做有什么好处,因为必须经常查询所有这些字段才能证明聚集索引的合理性,对吗?

所以基本上,当我去说服他们应该改变桌子设计时,我需要一些弹药。

I recently became involved with a new software project which uses SQL Server 2000 for its data storage.

In reviewing the project, I discovered that one of the main tables uses a clustered index on its primary key which consists of four columns:

Sequence  numeric(18, 0)
Date      datetime
Client    varchar(9)
Hash      tinyint

This table experiences a lot of inserts in the course of normal operation.

Now, I'm a C++ developer, not a DB Admin, but my first impression of this table design was that that having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.

In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?

So basically I need some ammunition for when I go to the powers that be to convince them that the table design should be changed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

眼中杀气 2024-09-17 08:40:39

聚集索引应包含查询最多的列,以提供最大的搜索机会或使非聚集索引覆盖查询中的所有列。

主键和聚集索引不必相同。它们都是候选键,并且表通常有多个这样的键。

你说

此外,我真的看不出这样做有任何好处,因为必须经常查询所有这些字段才能证明聚集索引的合理性,对吗?

那不是真的。只需使用聚集索引的第一列或第二列即可进行查找。这可能是范围搜索,但它仍然是搜索。您不必指定它的所有列即可获得该好处。但列的顺序确实很重要。如果您主要在客户端上进行查询,那么将序列列作为聚集索引中的第一个列是一个糟糕的选择。第二列的选择应该是与第一列一起查询最多的项目(而不是单独查询)。如果您发现第二列本身的查询频率几乎与第一列一样频繁,那么非聚集索引将会有所帮助。

正如其他人所说,尽可能减少聚集索引中的列/字节数很重要。

不幸的是,序列是一个随机值而不是递增的,但这可能无济于事。答案不是添加一个标识列,除非您的应用程序可以开始使用它作为该表的主要查询条件(不太可能)。现在,由于您陷入了这个随机序列列(假设它是最常查询的),让我们看看您的另一个语句:

将这些字段作为聚集索引会对插入性能非常不利,因为每次插入时都必须对数据进行物理重新排序。

这并不完全正确。

磁盘上的物理位置并不是我们在这里讨论的真正内容,但它确实会在碎片方面发挥作用,这对性能有影响。

每个 8k 页面内的行都是排序的。只是每一页的所有行都比下一页少,比上一页多。当您插入一行并且页面已满时,就会出现问题:出现页面拆分。引擎必须将插入行之后的所有行复制到新页面,这可能会很昂贵。使用随机密钥,您将获得大量页面拆分。您可以通过在重建索引时使用较低的填充因子来改善该问题。您必须反复尝试才能获得正确的数字,但 70% 或 60% 可能比 90% 更适合您。

我相信将日期时间作为第二个 CI 列可能是有益的,因为您仍然需要处理需要在两个不同的序列值之间拆分的页面,但它并不像 CI 中的第二列也是随机的那么糟糕,因为您可以保证在每次插入时进行页面拆分,其中使用升序值,如果该行可以添加到页面中,您会很幸运,因为下一个序列号从下一页开始。

缩短表中所有列的数据类型和数量及其非聚集索引也可以提高性能,因为每页更多行=每个请求更少的页面读取。特别是当引擎被迫进行表扫描时。将一堆很少查询的列移动到单独的 1-1 表中可以为您的某些查询带来奇迹。

最后,有一些设计调整也可能有所帮助(在我看来):

  • 将 Sequence 列更改为 bigint,以便为每行保存一个字节(8 个字节,而不是数字的 9 个字节)。
  • 使用具有 4 字节 int 标识列而不是 varchar(9) 的 Client 查找表。这每行节省 5 个字节。如果可能,请使用 Smallint(-32768 到 32767),即 2 个字节,每行可以节省 7 个字节。

摘要: CI 应该从查询最多的列开始。尽可能从 CI 中删除所有列。尽可能缩短列(字节)。使用较低的填充因子来减轻由随机序列列引起的页面拆分(如果由于查询最多而必须保留在第一位)。

哦,让你的在线碎片整理开始吧。如果桌子无法改变,至少可以经常重新整理,以保持最佳状态。也不要忽视统计数据,这样引擎就可以选择合适的执行计划。

更新

另一个要考虑的策略是表中使用的复合键是否可以转换为 int,并创建值的查找表。假设少于全部 4 列的某种组合在 100 多行中重复,例如序列 + 客户端 + 哈希,但仅具有不同的日期值。然后,插入具有标识列的单独的 SequenceClientHash 表可能是有意义的,因为这样您就可以查找人造键一次并一遍又一遍地使用它。这也将使您的 CI 仅在最后一页添加新行 (yay),并显着减小 CI 的大小,就像在所有非聚集索引中重复的那样 (yippee)。但这仅在某些狭窄的使用模式下才有意义。

现在,marc_s 建议添加一个额外的 int 标识列作为聚集索引。这可能有助于使所有非聚集索引每页获得更多行,但这完全取决于您想要的性能,因为这将保证表上的每个查询都必须使用书签查找,您永远无法进行表查找。

关于“大量的页面分割和糟糕的索引碎片”:正如我已经说过的,这可以通过较低的填充因子得到一定程度的改善。此外,频繁的在线索引重组(与重建不同)可以帮助减少这种影响。

最终,这一切都取决于确切的系统及其独特的数据访问模式以及有关您想要优化哪些部分的决策。对于某些系统,只要选择始终很快,插入速度较慢也不错。对于其他人来说,具有一致但稍慢的选择时间比具有稍快但不一致的选择时间更重要。对于其他人来说,数据在被推送到数据仓库之前不会被真正读取,因此插入需要尽可能快。此外,性能不仅与用户等待时间甚至查询响应时间有关,还与服务器资源有关,尤其是在大规模并行性的情况下,因此总吞吐量(例如,每时间单位的客户端响应)比任何其他因素都重要。

The clustered index should contain the column(s) most queried by to give the greatest chance of seeks or of making a nonclustered index cover all the columns in the query.

The primary key and the clustered index do not have to be the same. They are both candidate keys, and tables often have more than one such key.

You said

In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?

That's not true. A seek can be had just by using the first column or two of the clustered index. It may be a range seek, but it's still a seek. You don't have to specify all the columns of it in order to get that benefit. But the order of the columns does matter a lot. If you're predominantly querying on Client, then the Sequence column is a bad choice as the first in the clustered index. The choice of the second column should be the item that is most queried in conjunction with the first (not by itself). If you find that a second column is queried by itself almost as often as the first column, then a nonclustered index will help.

As others have said, reducing the number of columns/bytes in the clustered index as much as possible is important.

It's too bad that the Sequence is a random value instead of incrementing, but that may not be able to be helped. The answer isn't to throw in an identity column unless your application can start using it as the primary query condition on this table (unlikely). Now, since you're stuck with this random Sequence column (presuming it IS the most often queried), let's look at another of your statements:

having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.

That's not entirely true.

The physical location on the disk is not really what we're talking about here, but it does come into play in terms of fragmentation, which is a performance implication.

The rows inside each 8k page are not ordered. It's just that all the rows in each page are less than the next page and more than the previous one. The problem occurs when you insert a row and the page is full: you get a page split. The engine has to copy all the rows after the inserted row to a new page, and this can be expensive. With a random key you're going to get a lot of page splits. You can ameliorate the problem by using a lower fillfactor when rebuilding the index. You'd have to play with it to get the right number, but 70% or 60% might serve you better than 90%.

I believe that having datetime as the second CI column could be beneficial, since you'd still be dealing with pages needing to be split between two different Sequence values, but it's not nearly as bad as if the second column in the CI was also random, since you'd be guaranteed to page split on every insert, where with an ascending value you can get lucky if the row can be added to a page because the next Sequence number starts on the next page.

Shortening the data types and number of all columns in a table as well as its nonclustered indexes can boost performance too, since more rows per page = fewer page reads per request. Especially if the engine is forced to do a table scan. Moving a bunch of rarely-queried columns to a separate 1-1 table could do wonders for some of your queries.

Last, there are some design tweaks that could help as well (in my opinion):

  • Change the Sequence column to a bigint to save a byte for every row (8 bytes instead of 9 for the numeric).
  • Use a lookup table for Client with a 4-byte int identity column instead of a varchar(9). This saves 5 bytes per row. If possible, use a smallint (-32768 to 32767) which is 2 bytes, an even greater savings of 7 bytes per row.

Summary: The CI should start with the column most queried on. Remove any columns from the CI that you can. Shorten columns (bytes) as much as you can. Use a lower fillfactor to mitigate the page splits caused by the random Sequence column (if it has to stay first because of being queried the most).

Oh, and get your online defragging going. If the table can't be changed, at least it can be reorganized frequently to keep it in best possible shape. Don't neglect statistics, either, so the engine can pick appropriate execution plans.

UPDATE

Another strategy to consider is if the composite key used in the table can be converted to an int, and a lookup table of the values is created. Let's say some combination of less than all 4 columns is repeated in over 100 rows, for example, Sequence + Client + Hash but only with varying Date values. Then an insert to a separate SequenceClientHash table with an identity column could make sense, because then you could look up the artificial key once and use it over and over again. This would also get your CI to add new rows only on the last page (yay) and substantially reduce the size of the CI as repeated in all nonclustered indexes (yippee). But this would only make sense in certain narrow usage patterns.

Now, marc_s suggested just adding an additional int identity column as the clustered index. It is possible that this could help by making all the nonclustered indexes get more rows per page, but it all depends on exactly where you want the performance to be, because this would guarantee that every single query on the table would have to use a bookmark lookup and you could never get a table seek.

About "tons of page splits and bad index fragmentation": as I already said this can be ameliorated somewhat with a lower fill factor. Also, frequent online index reorganization (not the same as rebuilding) can help reduce the effect of this.

Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized. For some systems, having a slower insert isn't bad as long as selects are always fast. For others, having consistent but slightly slower select times is more important than having slightly faster but inconsistent select times. For others, the data isn't really read until it's pushed to a data warehouse anyway so the inserts need to be as fast as possible. And adding into the mix is the fact that performance isn't just about user wait time or even query response time but also about server resources especially in the case of massive parallelism, so that total throughput (say, in client responses per time unit) matters more than any other factor.

滥情稳全场 2024-09-17 08:40:39

聚集索引 (CI) 在不断增加、狭窄且很少变化的值上效果最佳。您将希望 CI 覆盖在使用 >=、<= 或 BETWEEN 语句的查询中最常被命中的列。

我不确定您的数据通常如何受到影响。大多数情况下,您会在 IDENTITY 列或另一个窄列上看到 CI(因为该列也将“附加”返回到所有非聚集索引,并且我们不希望在每次获取时添加大量数据如果不需要)。数据可能在日期上被最频繁地查询,这可能是一个不错的选择,但所有四列可能都不正确(我强调可能,因为我不知道设置 -起来;这可能没有什么问题吧)。这里有一些提示: http://msdn.microsoft .com/en-us/library/aa933131%28SQL.80%29.aspx

Clustered indexes (CI) work best over ever-increasing, narrow, rarely changing values. You'll want your CI to cover the column(s) that get hit the most often in queries with >=, <=, or BETWEEN statements.

I'm not sure how your data normally gets hit. Most often you'll see a CI on an IDENTITY column or another narrow column (because this column will also be returned "tacked on" to all non-clustered indexes, and we don't want a ton of data added on to every fetch if it isn't needed). It's possible the data might be getting queried most often on date, and that may be a good choice, but all four columns is likely not correct (I stress likely, because I don't know the set-up; this may not have anything wrong with it). There are some pointers here: http://msdn.microsoft.com/en-us/library/aa933131%28SQL.80%29.aspx

情深缘浅 2024-09-17 08:40:39

关于 SQL 如何创建和使用索引,您存在一些误解。

聚集索引 不一定通过聚集索引在磁盘上物理排序,至少不是实时排序。它们只是一个逻辑顺序。

我不认为基于此结构会造成重大性能影响,并且在您实际确定与该索引相关的性能问题之前删除聚集索引显然是不成熟的优化

此外,索引也很有用(尤其是其中包含多个字段的索引),即使对于不排序或不查询其中包含的所有列的搜索也是如此。

显然,创建多部分聚集索引应该有一个合理的理由,就像任何索引一样,因此如果您认为它是随意添加的,那么提出这个要求是有意义的。

底线:在实际检测到插入性能问题之前,不要优化插入性能的索引。通常这是不值得的。

There are a few things you are misunderstanding about how SQL creates and uses indexes.

Clustered indexes aren't necessarily physically ordered on disk by the clustered index, at least not in real-time. They are just a logical ordering.

I wouldn't expect a major performance hit based on this structure and removing the clustered index before you have actually identified a performance issue related to that index is clearly premature optimization.

Also, an index can be useful (especially one with several fields in it) even for searches that don't sort or get queried on all columns included in it.

Obviously, there should be a justification for creating a multi-part clustered index, just like any index, so it makes sense to ask for that if you think it was added capriciously.

Bottom line: Don't optimize the indexes for insert performance until you have actually detected a performance problem with inserts. It usually isn't worth it.

酒废 2024-09-17 08:40:39

如果表上只有单个聚集索引,那可能还不错。但是,聚集索引还用于查找非聚集索引中任何命中的真实数据页 - 因此,聚集索引(其所有列)也是您可能拥有的每个非聚集索引的一部分你的桌子。

因此,如果表上有一些非聚集索引,那么您肯定会 a) 浪费大量空间(不仅在磁盘上,还在服务器的 RAM 中!),b) 您的性能会很差。

一个好的聚集索引应该是:

  • 小(最好的选择:4 字节 INT) - 你的索引非常糟糕,每个条目最多 28 个字节
  • 唯一
  • 稳定(永远不会改变)
  • 不断增加

我敢打赌你当前的设置至少违反了两个如果不是更多这些要求。不遵循这些建议将导致空间浪费,正如您所说,大量页面和索引碎片以及页面拆分(当插入发生在聚集索引中间的某个位置时,必须“重新排列”数据)。

老实说:只需将代理 ID INT IDENTITY(1,1) 添加到您的表中,并将其设为主聚集键 - 您应该会看到性能的相当不错的提升,仅此而已,如果您有许多 INSERT(和 UPDATE)操作正在进行!

请在此处查看有关如何构成良好集群键以及它们的重要性的更多背景信息:

If you have only that single clustered index on your table, that might not be too bad. However, the clustering index is also used for looking up the real data page for any hit in a non-clustered index - therefor, the clustered index (all its columns) are also part of each and every non-clustered index you might have on your table.

So if you have a few nonclustered indices on your table, then you're definitely a) wasting a lot of space (and not just on disk - also in your server's RAM!), and b) your performance will be bad.

A good clustered index ought to be:

  • small (best bet: a 4-byte INT) - yours is pretty bad with up to 28 bytes per entry
  • unique
  • stable (never change)
  • ever-increasing

I would bet your current setup violates at least two if not more of those requirements. Not following these recommendations will lead to waste of space, and as you rightfully say, lots of page and index fragmentation and page splits (having to "rearrange" the data when an insert happens somewhere in the middle of the clustered index).

Quite honestly: just add a surrogate ID INT IDENTITY(1,1) to your table and make that the primary clustered key - you should see quite a nice boost in performance, just from that, if you have lots of INSERT (and UPDATE) operations going on!

See some more background info on what makes a good clustering key, and what is important about them, here:

情域 2024-09-17 08:40:39

我最终同意埃里克的最后一段:

“最终,这一切都取决于确切的系统及其独特的数据访问模式,以及您想要优化哪些部分的决策......”

这是我强迫人们学习的基本内容:没有通用的解决方案。

您必须了解您的数据以及对其执行的操作。您必须知道不同类型的操作的频率以及它们的影响和预期执行时间(如果最终用户同意查询执行时间并不那么重要,您不必硬调整一些很少执行的查询并影响其他所有内容 -假设每周等待几分钟等待一些报告是可以的)。当然,正如埃里克所说

“性能不仅仅与用户等待时间甚至查询响应时间有关,还与服务器资源有关”

如果此类查询影响整体服务器性能,则即使执行时间很好,也应将其视为优化的重要候选者。我见过一些非常快的查询在多处理器服务器上使用大量的 CPU,而从资源利用率的角度来看,稍微慢一点的解决方案是无与伦比的“轻量级”。在这种情况下,我几乎总是选择较慢的。

一旦您知道您的目标是什么,您就可以决定需要多少个索引以及应该对哪个索引进行聚类。唯一约束、过滤索引、包含列的索引都是非常强大的调优工具。选择正确的列很重要,但通常选择正确的列顺序更为重要。最后,如果表经常修改,不要使用大量索引来降低插入/更新性能。

I ultimately agree with Erik's last paragraph:

"Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized..."

This is the basic thing I force people to learn: there's no universal solution.

You have to know your data and the actions performed against it. You have to know how frequent different type of actions are and their impact and expected execution times (you don't have to hard tune some rarely executed query and impact everything else if the end user agrees the query execution time is not so important--let's say waiting for few minutes for some report once per week is okay). Of course, as Erik said

"performance isn't just about user wait time or even query response time but also about server resources"

If such a query affects overall server performance, it should be considered as a serious candidate for optimization, even if execution time is fine. I've seen some very fast queries that used huge amount of CPU on multiprocessor servers, while slightly slower solution were incomparable "lighter" from resource utilization point of view. In that case I almost always go for the slower one.

Once you know what is your goal you can decide how many indexes you need and which one should be clustered. Unique constraints, filtered indexes, indexes with included columns are quite powerful tools for tuning. Choosing proper columns is important, but often choosing proper order of columns is even more important. And at the end, don't kill insert/update performance with tons of indexes if the table is frequently modified.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文