非身份列上的聚集索引可以加速批量插入？

发布于 2024-09-24 00:34:51 字数 1193 浏览 13 评论 0原文

我的两个问题是：

我可以使用聚集索引来加速在大表中进行批量插入？
我还能有效地使用吗外键关系如果我的 IDENTITY 列不是聚簇列索引了？

详细地说，我有一个数据库，其中有几个非常大（100-1000 百万行之间）的表，其中包含公司数据。通常，这样的表中有 20-40 个公司的数据，每个公司都有自己的“块”，并用“CompanyIdentifier”(INT) 标记。此外，每家公司都有大约 20 个部门，每个部门都有自己的“子块”，以“DepartmentIdentifier”(INT) 标记。

经常会发生从表中添加或删除整个“块”或“子块”的情况。我的第一个想法是对这些块使用表分区，但由于我使用的是 SQL Server 2008 标准版，所以我无权使用它。尽管如此，我的大多数查询都是在“块”或“子块”上执行的，而不是在整个表上执行的。

我一直致力于优化这些表的以下功能：

在子块上运行的查询
在整个表上运行的“基准测试”查询
插入/删除大块数据。

对于1）和2）我没有遇到很多问题。我已经在关键字段上创建了几个索引（还包含有用的 CompanyIdentifier 和 DepartmentIdentifier），并且查询运行良好。

但对于3）我一直在努力寻找一个好的解决方案。我的第一个策略是始终禁用索引，批量插入大块并重建索引。一开始这个速度很快，但是现在数据库中有很多公司，每次重建索引都需要很长的时间。

目前，我的策略已更改为在插入时仅保留索引，因为现在这似乎更快。但我想进一步优化插入速度。

我似乎注意到，通过添加在 CompanyIdentifier + DepartmentIdentifier 上定义的聚集索引，将新“块”加载到表中的速度更快。在我放弃这个策略转而在 IDENTITY 列上添加聚集索引之前，有几篇文章向我指出聚集索引包含在所有其他索引中，因此聚集索引应该尽可能小。但现在我正在考虑恢复这种旧策略以加快插入速度。我的问题是，这是否明智，或者我会在其他领域遭受性能损失吗？这真的会加快我的插入速度还是这只是我的想象？

我也不确定在我的情况下是否真的需要 IDENTITY 列。我希望能够与其他表建立外键关系，但是我也可以使用类似 CompanyIdentifier+DepartmentIdentifier+[uniquifier] 方案吗？或者它必须是一个全表范围的、碎片化的 IDENTITY 编号？

非常感谢您的任何建议或解释。

原文

My two questions are:

Can I use clustered indexes to speed
up bulk inserts in big tables?
Can I then still efficiently use
foreign key relationships if my
IDENTITY column is not the clustered
index anymore?

To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).

It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.

I have been working to optimize these tables for the following functions:

Queries that are run on subchunks
"Benchmarking" queries that are run on the table as a whole
Inserting/removing big chunks of data.

For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.

But for 3) I have struggled to find a good solution.
My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.

At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.

I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?

I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?

Thanks a lot for any suggestions or explanations.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

以歌曲疗慰 2024-10-01 00:34:51

好吧，我已经对其进行了测试，并且在两个“块定义”列上放置聚集索引可以提高表的性能。

与我有聚集 IDENTITY 键的情况相比，现在插入块的速度相对较快，并且与没有任何聚集索引时的速度差不多。删除块比使用或不使用聚集索引要快。

我认为我想要删除或插入的所有记录都保证集中在硬盘的某个部分，这使得表速度更快 - 对我来说这似乎是合乎逻辑的。

更新：经过一年的设计经验，我可以说，要使这种方法发挥作用，有必要安排定期重建所有索引（我们每周一次）。否则，索引很快就会变得碎片化并且性能下降。尽管如此，我们正在迁移到带有分区表的新数据库设计，它基本上在各个方面都更好 - 除了企业服务器许可证成本，但我们现在已经忘记了它。至少我有。

回复收藏 0 原文

信愁 2024-10-01 00:34:51

聚集索引是一种物理索引、一种物理数据结构、一种行顺序。如果在聚集索引的中间插入，数据将物理插入到当前数据的中间。我想在这种情况下会出现严重的性能问题。我只从理论上知道这一点，因为如果我在实践中这样做，根据我的理论知识，这将是一个错误。

因此，我只在始终物理地插入到末尾的字段上使用（并建议使用）聚集索引，以保留顺序。

聚集索引可以放置在日期时间字段上，该字段标记插入时刻或类似的内容，因为从物理上来说，它们将在附加行后进行排序。 Identity 也是一个很好的聚集索引，但并不总是与查询相关。

在您的解决方案中，您放置了一个 [uniquifier] 字段，但是当您可以放置一个可以做到这一点的身份时，为什么要这样做呢？它将是唯一的、物理有序的、小（对于其他表中的外键意味着更小的索引），并且在某些情况下更快。

你就不能尝试一下吗？我这里也有类似的情况，我有 40 亿行，不断有更多行插入（每秒最多 100 行），表没有主键，也没有聚集索引，所以这个主题中的命题对我来说也非常有趣。

回复收藏 0 原文

臻嫒无言 2024-10-01 00:34:51

我可以使用聚集索引来加速大表中的批量插入吗？

永远不会！想象一下，您需要将另外一百万行放入该表中并对其进行物理排序，从长远来看，这将是性能的巨大损失。

如果我的 IDENTITY 列不再是聚集索引，我还能有效地使用外键关系吗？

绝对地。顺便说一句，聚集索引并不是灵丹妙药，而且可能比普通索引慢。

回复收藏 0 原文

窝囊感情。 2024-10-01 00:34:51

查看 System.Data.SqlClient.SqlBulkCopy API。考虑到您需要在数据库中写入大量行，这可能就是您所需要的？

大容量复制通过单个操作将数据流式传输到表中，然后执行一次索引检查。我使用它在数据库表中复制 500,000 行，并且它的性能比我尝试过的任何其他技术都要好一个数量级，假设您的应用程序可以构建为使用 API？

回复收藏 0 原文

奈何桥上唱咆哮 2024-10-01 00:34:51

我最近一直在玩一些 etl 东西。我定期插入表，然后在插入之前和之后删除和读取索引，尝试合并语句，然后我终于尝试了 ssis。我被ssis卖了。就在昨天，我设法将每次运行的 etl 过程（约 2400 万条记录，约 6GB）从约 1-1 1/2 小时缩短到约 24 分钟，只是让 ssis 处理插入。

我相信有了高级服务，您应该能够使用 ssis。

回复收藏 0 原文

め可乐爱微笑 2024-10-01 00:34:51

（鉴于您已经选择了答案并给了自己积分，这是作为免费服务提供的，一种慈善行为！）

一知半解是一件危险的事情。有很多问题需要考虑；并且必须将它们一起考虑。处理任何一个问题并单独检查它是一种非常分散的管理数据库的方法：您将永远发现一些新的事实并改变您之前想到的一切。在开始之前，请阅读此▶问题/答案◀ 的上下文。

不要忘记，如今任何拥有键盘和调制解调器的人都可以发表他们的“论文”。他们中的一些人为 MS 工作，传播最新的“增强功能”；其他人发布了他们从未在某种情况下使用过或只使用过一次的功能的热情报告，但他们发布了该功能在每种情况下都有效。（看看斯彭斯的回答：他很热情，“被卖”，但仔细一看，这些说法都是假的；他不是坏人，只是微软世界中大众的典型，以及他们如何运作；他们如何发布。）

注意：我使用“MicroSofties”一词来描述那些相信任何不合格的人都可以管理数据库的盖茨概念的人； MS 会解决一切问题。它并不是一种侮辱，更多的是一种爱意，因为对魔法的信仰和物理定律的悬置。

聚集索引

是为关系数据库设计的，由真实的工程师（Sybase，在 MS 获得代码之前）的大脑比所有 MS 的大脑加起来还多。关系数据库有关系键，而不是Id物联网键。这些是多列键，可以自动分配数据，从而自动分配插入负载，例如。始终插入各个公司的发票（尽管不在我们讨论的“块”的情况下）。

如果您有良好的关系键，CI 会提供范围查询（您的 (1) 和 (2) ）以及 NCI 根本不具备的其他优势。
在建模和标准化数据之前从 Id 列开始，会严重阻碍建模和标准化过程。
如果您有一个Id物联网数据库，那么您将拥有更多的索引。许多 MS 数据库的内容不是“关系型”的，它们通常只是非标准化的文件系统，其索引比标准化数据库多得多。因此，有一个很大的推动力，大量的MS“增强”来尝试给这些堕胎带来一点速度。修复症状，但不要接近导致症状的问题。
在 SQL 2005 和 2008 年，MS 都在 CI 上搞砸了，结果是它们现在在某些方面更好，但在其他方面更糟； CI 的通用性已经丧失。
NCI 携带 CI 是不正确的（CI 是基本的单一存储结构；NCI 是次要的，并且依赖于 CI；这就是为什么当您重新创建 CI 时，所有 NCI 都是自动重新创建）。 NCI 在叶级别携带 CI Key。
微软有自己的问题，这些问题随着主要版本的发布而改变（但并未消除）：
- 在 MS 中，这并不能有效地完成，因此 NCI 索引大小很大；在企业 DBMS 中，如果可以有效地完成此操作，则无需考虑这一点。
- 因此，在 MS 世界中，CI 密钥应尽可能短这一说法只对了一半。如果您了解考虑因素是 NCI 的大小，并且如果您愿意承担这笔费用，并且由于精心构建的 CI 而返回一个非常快的表，那么这是最好的选择。
  < /里>
- CI 应该是Idiot 列的常见建议是完全错误的。 CI 键最差的候选者是单调递增的值（IDENTITY、DATETIME 等）。为什么？因为您已经保证所有并发插入都会争夺当前插入位置，即索引上的最后一页。
- 分区（MS 在企业供应商之后 10 年提供）的真正目的是分散此负载。当然，他们必须提供一种分配分区的方法，猜猜看，除了关系键之外什么都没有；但首先，现在 Idiot 密钥分布在 32 或 64 个分区上，提供更好的并发性。
CI 必须是唯一的。关系数据库需要唯一的键，所以这是理所当然的。
- 但是对于向数据库中倒入非关系型内容的业余爱好者来说，如果他们不知道这个规则，但是他们知道CI传播数据（一知半解是一件危险的事情），他们就会保留他们在 NCI 中拥有 ID 物联网密钥（很好），但他们在几乎但不完全是唯一密钥上创建 CI。致命。 CI必须是唯一的，这是设计要求。重复（记住我们在这里谈论 CI Key）行位于页外，位于溢出页面和（当时的）最后一页；并构成了页面链严重碎片化的方法。
- 更新，因为这一点在其他地方受到质疑。我已经说过 MS 不断改变方法而没有解决问题。
  - MS Online 手册< /a>，用他们漂亮的图片（不是技术图表）告诉我们，在 2008 年，他们用可爱的“Uniqueifier”取代了（用一个替换另一个）溢出页面。
因此，从一开始（1984 年）就没有改变，CI 的最佳候选是多列唯一关系键（我不能说你的键是肯定的，但它看起来确实是这样）。
并将任何单调递增的键（IDENTITY、DATETIME）放入 NCI 中。
还请记住，CI 是单个存储结构，它消除了（否则）堆； CI B-Tree 与叶级别的行结合；叶级条目是行。这保证了每次访问时减少一次读取。
- 因此 NCI+Heap 不可能比 CI 更快。 MS 世界中另一个违背物理定律的常见神话：导航 B 树并将其写入您已经所在的位置，必须比另外将行写入单独的位置更快存储结构。但微软确实相信魔法，他们已经暂停了物理定律。
  .
还有许多其他功能需要学习和使用，我至少会提到 FILLFACTOR 和 RESERVEPAGEGAP，以使这篇文章具有一定的完整性。在了解这些功能之前，请勿使用它们。所有性能特性都有一个您需要理解和接受的成本。
CI 也在页面和范围级别进行自我调整，不会浪费空间。 PageSplits 是需要监视的内容（仅限随机插入），并且可以通过 FILLFACTOR 和 RESERVEPAGEGAP 轻松调节。
并阅读 SO 站点的聚集索引，但请记住以上所有内容，尤其是。前两段。

您的具体案例

无论如何，摆脱您的代理键（Idiot 列），并将其替换为真正的自然关系键。代理始终是附加键和索引；这是一个不应该被遗忘或掉以轻心的代价。
CompanyIdentifier+DepartmentIdentifier+[uniquiefier] 正是我所说的。现在请注意，它们已经是 INT，并且速度非常快，因此添加 NUMERIC(10,0) Idiot Key 是非常愚蠢的。使用 1 或 2 字节列来强制唯一性。
如果您正确理解了这一点，您可能不需要分区许可证。
在您定期执行批量删除/插入的情况下，CompanyIdentifier+DepartmentIdentifier+[uniquifier] 是 CI 的完美候选者（除了您发布的数据之外，不知道有关您的数据库的任何信息）。详细内容见上文。
- 与其他人所说的相反，这是一件好事，并且不会破坏 CI。假设您有 20 个公司，您删除了 1 个公司，这相当于数据的 5%。整个 PageChain 原本相当连续，现在被降级到 FreePageChain，连续且完整。准确地说，您有一个单点碎片，但不是正常使用该词意义上的碎片。猜猜看，如果您转身执行批量插入，您认为数据会去哪里？这与已删除行的物理位置完全相同。 FreePageChain 一次移动到 PageChain、范围和页面。
  .
但令人担忧的是，您不知道 CI 的需求是独特的。遗憾的是微软写的都是垃圾，但不是每个简单的规则是基于什么/为什么/什么；不是核心信息。非唯一 CI 的确切症状是，在 DROP/CREATE CI 后，表会立即变得非常快，然后随着时间的推移而变慢。一个好的 Unique CI 将保持其速度，并且需要一年的时间才能放慢速度（在我的大型活跃银行数据库上需要 2 年）。
对于 10 亿行来说，4 小时是一个非常长的时间（我可以在企业平台上 3 分钟内使用 6 列键在 160 亿行上重新创建 CI）。但无论如何，这意味着您必须将其安排为每周定期或要求维护。
为什么不使用WITH SORTED_DATA选项？你的数据在下降之前没有排序吗？此选项重写 CI 非叶页，但不重写叶页（包含行）。只有当它确信数据已排序时才能这样做。不使用此选项会按物理顺序重写每个页面。

现在，请友善一点。在你问我二十个问题之前，请先阅读一下并理解我在这里定义的所有问题。

(Given you have already chosen the Answer and given yourself the points, this is provided as a free service, a charitable act !)

A little knowledge is a dangerous thing. There are many issues to be considered; and they must be considered together. Taking any one issue and examining it in isolation is a very fragmented way to go about administering a database: you will forever be finding some new truth and changing eveything you thought before. Before launching into it, please read this ▶question/answer◀ for context.

Do not forget, these days anyone with a keyboard and a modem can get their "papers" published. Some of them work for MS, evangelising the latest "enhancement"; others publish glowing reports of features they have never used, or used only once, in one context, but they publish that it works in every context. (Look at Spence's answer: he is enthusiastic and "sold" but under scrutiny, the statements are false; he is not a bad person, just typical of the masses in the MS world and how they operate; how they publish.)

Note: I use the term MicroSofties to describe those people who believe in the gatesian notion that any unqualified person can administer a database; and that MS will fix everything. It is not intended as an insult, more as an endearment, because of the belief in magic, and the suspension of the laws of physics.

Clustered Indices

Were designed for Relational databases, by real engineers (Sybase, before MS acquired the code) who have more brains than all of MS put together. Relational databases have Relational Keys, not Idiot keys. These are multi-column keys, that automatically distribute the data, and therefore the insert load, eg. inserting Invoices for various Companies all the time (although not in our discussed case of "chunks").

if you have good Relational keys, CIs provide Range Queries (your (1) & (2) ), and other advantages, that NCIs simply do not have.
Starting off with Id columns, before modelling and normalising the data, severely hinders the modelling and normalisation processes.
If you have an Idiot database, then you will have more indices than not. The contents of many MS databases are not "relational", they are commonly just unnormalised filing systems, with way more indices than a Normalised database would have. Therefore there is a big push, a lot of MS "enhancements" to try and give these abortions a bit of speed. Fix the symptom but don't go anywhere near the problem that caused the symptom.
In SQL 2005 and again in 2008 MS has screwed around with CIs, and the result is they are now better in some ways, but worse in other ways; the universality of CIs has been lost.
It is not correct that NCIs carry the CI (the CI is the basic single storage structure; the NCIs are secondary, and dependent on the CI; that's why when you re-create a CI, all the NCIs are automatically re-created). The NCIs carry the CI Key at the leaf level.
Microsoft has its problems, which change with the major releases (but are not eliminated):
- and in MS this is not efficiently done, so the NCI index size is large; in enterprise DBMS when this is efficiently done, this is not a consideration.
- In the MS world, therefore, it is only half true, that the CI key should be as short as possible. If you understand that the consideration is the size of NCIs, and if you are willing to incur that expense, it return for a table that is very fast due to a carefully constructed CI, then that is the best option.
- The common advice that the CI should be theIdiot column is totally and completely wrong. The worst canditate fo a CI key is a monotonically increasing value (IDENTITY, DATETIME, etc). WHy ? because you have guaranteed that all concurrent inserts will fight for the current insert location, the last page on the index.
- The real purpose of Partitioning (Which MS provided 10 years after the Enterprise vendors) is to spread this load. Sure, they then have to provide a method of allocating the Partitions, on guess what, nothing but a Relational Key; but to start with, now the Idiot key is spread across 32 or 64 Partitions, providing better concurrency.
the CI must be Unique. Relational dbs demand Unique keys, so that is a no-brainer.
- But for the amateurs who have poured non-relational contents into the database, if they do not know this rule, but they know that the CI spreads the data (a little knowledge is a dangerous thing), they keep their Idiot key in a NCI (good) but they create the CI on an almost-but-not-quite Unique Key. Deadly. CI's must be Unique, that is a design demand. Duplicate (remember we are talking CI Key here) rows are off-page, located in Overflow pages, and the (then) last page; and constitute a method of badly fragmenting the Page Chain.
- Update, since this point is being questioned elsewhere. I have already stated the MS keeps changing the methods without fixing the problem.
  - The MS Online manual, with their pretty pictures (not technical diagrams) tells us that In 2008, they have replaced (substitued one for another) Overflow Pages, with the adorable "Uniqueifier".
  - That totally satisfies the MicroSofties. Non-Unique CIs are not a problem. It is handled by magic. Case closed.
  - But there is no logic or completeness to the statements, and qualified people will ask the obvious questions: where is this "Uniqueifier" located ? On every row, or just the rows needing "Uniqueifying". DBBC PAGE shows it is on every row. So MS has just added a 4-byte secret column (including handling overhead) to every row, instead of a few Overflow Pages for the non-unique rows only. That's MS idea of engineering.
  - End Update
- Anyway, the point remains, that Non-Unique CIs have a substantial overhead (now more than before) and should be avoided. you would be better off adding a 1- or 2-byte column yourself, to force uniqueness.
  .
Therefore, unchanged from the beginning (1984), the best candidate for a CI is a multi-column unique Relational key (I cannot say that yours is for sure, but it certainly looks like it).
And put any monotonically increasing keys (IDENTITY, DATETIME) in an NCI.
Remember also that the CI is a single storage structure, which eliminates the (otherwise) Heap; the CI B-Tree is married to the rows at the Leaf level; the Leaf Level entry is the row. That guarantees one less read on every access.
- So it is not possible, that a NCI+Heap can be faster than a CI. Anther common myth in the MS world that defies the laws of physics: navigating a B-Tree and writing to the one place you are already in, has got to be faster than additionally writing the row to a separate storage structure. But MicroSofties do believe in magic, they've suspended the laws of physics.
  .
There are many other features you need to learn and use, I will mention at least FILLFACTOR and RESERVEPAGEGAP, to give this post a bit of completeness. Do not use these features until you understand them. All performance features have a cost that you need to understand and accept.
CIs are also self-trimming at both the Page and Extent level, there is no wasted space. PageSplits are something to monitor for (Random inserts only), and that is easily modulated by FILLFACTOR and RESERVEPAGEGAP.
And read the SO site for Clustered Indices, but keep in mind all the above, esp. the first two paras.

Your Specific Case

By all means, get rid of your surrogate keys (Idiot columns), and replace them with true natural Relational keys. Surrogates are always an additional key and index; that is a price that should not be forgotten or taken lightly.
CompanyIdentifier+DepartmentIdentifier+[uniquiefier] is exactly what I am talking about. Now notice that they are already INTs, and very fast, so it is very silly to add a NUMERIC(10,0) Idiot Key. Use a 1- or 2-byte column toforce Uniqueness.
If you get this right, you may not need a Partition licence.
The CompanyIdentifier+DepartmentIdentifier+[uniquifier] is the perfect candidate (not knowing anything about your db other than that which you have posted) for a CI, in the context that you perform mass delete/insert periodically. Detailed above.
- Contrary to what others have stated, this is a good thing, and does not fragment the CI. Lets' say ou have 20 Companies, and you delete 1, which constitutes 5% of the data. That entire PageChain which was reasonably contiguous, is now relegated to the FreePageChain, contiguous and intact. To be precise, you have a single point of fragmentation, but not fragmentation in the sense of the normal use of the word. And guess what, if you turn around and perform a mass insert, where do you think that data will go ? That's right the exact same physical location as the Deleted rows. And the FreePageChain moves to the PageChain, extent and page at a time.
  .
but what is alarming is that you did not know about the demand for CI to be Unique. Sad that the MicroSofties write rubbish, but not why/what each simplistic rule is based on; not the core information. The exact symptom of non-unique CIs is, the table will be very fast immediately after DROP/CREATE CI, and then slow down over time. An good Unique CI will hold its speed, and it would take a year to slow down (2 years on my large, active banking dbs).
4 hours is a very long time for 1 Billion rows (I can recreate a CI on 16 billion rows with a 6-column key in 3 minutes on an enterprise platform). But in any case, that means you have to schedule it as regular weekly or demand maintenance.
why aren't you using the WITH SORTED_DATA option ? Wasn't your data sorted, before the drop ? This option rewrites the CI Non-leaf pages but not the leaf pages (containing the rows). It can only do that if it is confident that the data was sorted. Not using this option rewrites every page, in physical order.