在临时表上创建主键 - 何时?
我有一个正在处理大量数据的存储过程。 我已将该数据插入到临时表中。 事件的整体流程类似于
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
我的问题是什么时候是在我的#TempTable表上创建主键的最佳时间?我推测我应该在插入所有内容后创建主键约束/索引数据,因为在创建主键信息时需要重新组织索引。 但我意识到我的强调假设可能是错误的......
如果它是相关的,我使用的数据类型是真实的。 在 #TempTable
表中,Col1
和 Col4
将构成我的主键。
更新:在我的例子中,我复制了源表的主键。 我知道构成我的主键的字段将始终是唯一的。 如果我在最后添加主键,我不担心更改表失败。
尽管如此,我的问题仍然是假设两者都会成功,哪个更快?
I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable
table, Col1
and Col4
will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
这取决于很多。
如果在加载后对主键索引进行聚簇,则整个表将被重写,因为聚簇索引并不是真正的索引,而是数据的逻辑顺序。 插入的执行计划将取决于计划确定时到位的索引,如果聚集索引到位,它将在插入之前排序。 您通常会在执行计划中看到这一点。
如果将主键设置为简单约束,则它将是常规(非聚集)索引,并且表将按照优化器确定的任何顺序进行填充并更新索引。
我认为总体最快的性能(加载临时表的过程)通常是将数据写入堆,然后应用(非聚集)索引。
然而,正如其他人指出的那样,索引的创建可能会失败。 此外,临时表并不是孤立存在的。 大概有一个最佳索引可以从中读取下一步的数据。 该索引需要就位或创建。 这里是您必须在可靠性(首先应用 PK 和任何其他约束)和稍后的速度(如果您打算有一个聚簇索引)之间进行速度权衡的地方)。
This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).
如果数据库的恢复模式设置为简单或批量记录,则 SELECT ... INTO ... UNION ALL 可能是最快的解决方案。 SELECT .. INTO 是批量操作,批量操作的日志记录最少。
例如:
否则,Cade Roux 在之前或之后提出了很好的建议。
If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
Otherwise, Cade Roux had good advice re: before or after.
您也可以在插入之前创建主键 - 如果主键位于标识列上,那么无论如何插入都会按顺序完成,并且不会有任何区别。
You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.
比性能考虑更重要的是,如果您不能绝对、100% 确定会将唯一值插入到表中,请首先创建主键。 否则主键创建失败。
这可以防止您插入重复/错误的数据。
Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.
如果您在创建表时添加主键,则第一次插入将是免费的(不需要检查)。第二次插入只需查看它是否与第一次不同。 第三次插入必须检查两行,依此类推。 检查将是索引查找,因为存在唯一约束。
如果在所有插入之后添加主键,则每行都必须与其他行匹配。 所以我的猜测是,尽早添加主键会更便宜。
但也许 Sql Server 有一种非常聪明的方法来检查唯一性。 因此,如果您想确定的话,请测量一下!
If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!
我想知道是否可以改进一个非常非常“昂贵”的存储过程,该存储过程需要在每次插入表时进行一系列检查,并找到了这个答案。 在存储过程中,打开了多个临时表并相互引用。 我将主键添加到 CREATE TABLE 语句中(即使我的选择使用 WHERE NOT EXISTS 语句来插入数据并确保唯一性),并且我的执行时间被严重缩短。 我强烈建议使用主键。 即使您认为不需要它,也至少要尝试一下。
I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.
我认为这对您的情况没有任何重大影响:
当您如果 PK 值不是系统创建的,则在插入开始之前预先创建它,您可能会在插入数据时捕获 PK 违规。
但除此之外——真的没有太大区别。
马克
I don't think it makes any significant difference in your case:
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc
我本来不打算回答这个问题,因为我对自己的了解不是 100% 有信心。 但因为它看起来并没有得到太多的回应...
我的理解是 PK 是一个唯一的索引,当您插入每条记录时,您的索引会被更新和优化。 所以...如果先添加数据,然后创建索引,索引只优化一次。
因此,如果您确信您的数据是干净的(没有重复的 PK 数据),那么我会说插入,然后添加 PK。
但是如果你的数据可能有重复的PK数据,我会说先创建PK,这样它就会尽快被炸掉。
I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.
当您在表创建时添加 PK 时 - 插入检查为
O(Tn)
(其中Tn
是“第 n 个三角数”,即1 + 2 + 3 ... + n
) 因为当您插入第 x 行时,会根据之前插入的“x - 1”行进行检查当您在插入所有值之后添加 PK 时 -检查器的复杂度为
O(n^2)
,因为当您插入第 x 行时,系统会针对所有n
现有行进行检查。第一个显然更快,因为
O(Tn)
小于O(n^2)
PS 示例:如果插入 5 行,则为
1 + 2 + 3 + 4 + 5 = 15
次操作与5^2 = 25
次操作When you add PK on table creation - the insert check is
O(Tn)
(whereTn
is "n-th triangular number", which is1 + 2 + 3 ... + n
) because when you insert x-th row, it's checked against previously inserted "x - 1" rowsWhen you add PK after inserting all the values - the checker is
O(n^2)
because when you insert x-th row, it's checked against alln
existing rows.First one is obviously faster since
O(Tn)
is less thanO(n^2)
P.S. Example: if you insert 5 rows it is
1 + 2 + 3 + 4 + 5 = 15
operations vs5^2 = 25
operations