重新索引大表 - 我有多糟糕?
我有一个 1 TB、600m 行的表,该表对索引列的选择存在误导,特别是主键列上的聚集索引,该索引从未在选择查询中使用。
我想从该行中删除聚集索引并在许多其他行上创建它。
目前的表如下:
colA (PK, nvarchar(3)) [聚集索引 pt b]
colB (PK, bigint) [聚集索引 pt a]
colC (DateTime) [非聚集索引]
colD (Money) [非聚集索引]
colE (位) [无索引]
colF(位)[无索引]
colG(int)[无索引]
更多非索引列
我想将其更改为如下所示:
colA (PK, nvarchar(3)) [聚集索引 pt a]
colB (PK, bigint) [非聚集索引]
colC (DateTime) [非聚集索引]
colD (Money) [聚集索引 pt d]
colE (位) [聚集索引 pt b]
colF (位) [聚集索引 pt c]
colG (int) [聚集索引 pt e]
更多非索引列
两个问题: 1)您估计此更改将需要多长时间(消息末尾的服务器规范)。 不幸的是,它是一个实时数据库,如果不知道它会停机多长时间,我就无法停机。
2)向聚集索引添加这么多列是不是一个糟糕的主意? 几乎从不执行更新。 有许多插入和许多选择总是使用所有建议的索引行作为选择参数。
服务器规格:RAID 5 中的 5 个 15kRPM 驱动器、MS-SQL Sever 2005 以及一些保持其运行的位。
I have a 1 TB, 600m row, table which has a misguided choice of indexed columns, specifically a clustered index on the primary key column which is never used in a select query.
I want to remove the clustered index from this row and create it on a number of other rows.
Table is currently like this:
colA (PK, nvarchar(3)) [clustered index pt b]
colB (PK, bigint) [clustered index pt a]
colC (DateTime) [non-clustered index]
colD (Money) [non-clustered index]
colE (bit) [no index]
colF (bit) [no index]
colG (int) [no index]
more non-indexed columns
I would like to change it to look like this:
colA (PK, nvarchar(3)) [clustered index pt a]
colB (PK, bigint) [non-clustered index]
colC (DateTime) [non-clustered index]
colD (Money) [clustered index pt d]
colE (bit) [clustered index pt b]
colF (bit) [clustered index pt c]
colG (int) [clustered index pt e]
more non-indexed columns
Two questions:
1) How long would you guesstimate that this change will take (server spec at end of message). Unfortunately it is a live DB and I can't have downtime without some idea of how long it will be down for.
2) Is it a terrible idea to add so many columns to a clustered index? Updates are nearly never performed. There are many inserts and many selects which always use all of the proposed indexed rows as select parameters.
Server spec: 5 x 15kRPM drives in RAID 5, MS-SQL Sever 2005 and some bits to keep them running.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
一方面,我会避免使聚集索引比它绝对需要的更宽。 把它分成五个部分似乎会适得其反。 该复合聚集索引中的所有列是否稳定,例如永远不会改变?
如果没有,我会不惜一切代价避免他们。 聚集索引应该是:
您可以更改非聚集索引 - 没问题。 但要避免让聚集索引变得混乱! 这肯定会降低你的表现!
查看 Kimberly Tripp 关于索引的优秀博客文章:
Marc
For one thing, I would AVOID making the clustered index wider than it absolutely has to be. Making it into five parts seems about contra-productive. Are ALL the columns in this compound clustered index stable, e.g. never change??
If not, I would avoid them at all costs. A clustered index should be:
You can change your non-clustered indices - no problem. But avoid making the clustered index messy! That'll definitely bring down your performance!
Check out Kimberly Tripp's excellent blog articles on indexing:
Marc
我进行了更改,并且没有花太长时间。
以下是每个操作的时间,第一次是在具有单个 7200RPM 驱动器的备份服务器上运行,第二次是在 RAID 中具有 15k 驱动器的主服务器上运行。
2:39 小时/19 分钟
15:30 小时/2 小时
4 小时/1 小时
现在最常用的选择查询需要 << 以前通常需要 10 到 15 分钟,现在只需 10 秒。 不错的改进! 插入时间似乎也快了一点。
I made the changes and it didn't take too long.
Here are the times for each operation, first time is when run on a backup server with a single 7200RPM drive, and the second on the main server with 15k drives in RAID.
2:39 hrs / 19 minutes
15:30 hrs / 2 hrs
4 hrs / 1 hr
The select query most often used now takes < 10 seconds where it often took 10 to 15 minutes before. Nice improvement! Insert times seem a bit faster too.
您应该有一个具有类似规格的开发环境,您可以使用它来尝试使用实时数据库的副本。
You should have a development environment with similar specs that you can use to try this with a copy of the live database.
虽然更改聚集索引听起来肯定会有所帮助,但为什么不首先尝试添加(非聚集)覆盖索引呢?
在构建新索引时不应删除该表,并且应向您指示将导致此重组的性能改进(如果有)。
While changing the clustered index sounds like it would certainly help here, why don't you try adding a (nonclustered) covering index first?
Shouldn't take the table down while the new index is built, and should give you an indication of what performance improvement (if any) will result in this reorganization.
您可能不需要担心停机时间,因为它可能是 可以实时进行更改(没有任何停机时间)。 适用于SQL Server 2005企业版。
You may not need to worry about the downtime, as it may be possible to do the change live (without any downtime). Applies to SQL Server 2005 Enterprise edition.
如果您有磁盘空间,您可以做的一件事是使用正确的聚集索引创建第二个表,通过增量过程在几天内将行复制到新表。 一旦所有行都存在,就在两个表上执行 sp_rename (这将只需要几分钟的停机时间。如果您的应用程序引用视图而不是物理表,您可以在应用程序零停机时间的情况下完成此操作。我希望这会有所帮助。
[编辑]
您还必须处理行的更新,需要有时间戳或源表上可用的上次更新字段,以便在复制所有行后可以同步更新。
One thing you could do if you have the disk space is create a second table with the correct clustered index copy the rows over over to the new table over several days via an incremental process. Once all the rows are there execute sp_rename on both tables (this would require just a few minutes of downtime. If your apps were referencing a view instead of the physical table you could have this done with zero downtime to your apps. I hope this helps.
[Edit]
You'll also have to deal with the update to the rows, you need to have a timestamp, or last updated field available on the source table so that you can sync the updates once you have all the rows copied over.
我同意布莱恩的观点,你应该有一个具有相同数据量的测试数据库并运行索引更改。 但是,我认为您进行此更改是因为您认为这会加快查询速度。 您应该运行基准测试(在索引更改之前和之后)并确保您的优化不会变得悲观。
I agree with Brian, you should have a test database with same amount of data and run the index change. But, I presume that you are making this change because you think it will speed up the queries. You should run benchmarks test (before and after the index change) and ensure that your optimization doesn't become a pessimization.
1) 您预计此更改需要多长时间(消息末尾的服务器规范)。 不幸的是,它是一个实时数据库,如果不知道它会停机多长时间,我就无法停机。
这真的,真的取决于数据。 仅表参数并不能提供足够的信息。 可能是几分钟(不太可能)到几天(不太可能),最可能的时间介于两者之间。
2)向聚集索引添加如此多的列是不是一个糟糕的主意? 几乎从不执行更新。 有许多插入和许多选择总是使用所有建议的索引行作为选择参数。
不,这不应该造成任何问题。 仅当您进行少量更新时,性能才会提高。 但是,当发生这些更新时,修复索引需要一段时间,并且在此期间性能会受到影响,具体情况取决于数据。
1) How long would you guesstimate that this change will take (server spec at end of message). Unfortunately it is a live DB and I can't have downtime without some idea of how long it will be down for.
It really, really depends on the data. Just the table parameters alone doesn't provide enough information. Could be a few minutes (unlikely) to a few days (unlikely) with the likeliest time being somewhere inbetween.
2) Is it a terrible idea to add so many columns to a clustered index? Updates are nearly never performed. There are many inserts and many selects which always use all of the proposed indexed rows as select parameters.
No, that should not pose any problems. Performance should only improve if you are making few updates. When those updates occur, it'll take awhile to fix the index, though, and performance will suffer during that time, which will vary depending on the data.