在 SQL Server 中执行大量数据操作的最佳方法是什么?
我们需要在数据库中执行以下操作:
有一个表 A,其列 B_ID 是表 B 的外键。表 A 中有许多行具有相同的 B_ID 值,我们想要修复这是通过克隆 B 中的相应行并将行从 A 重定向到它们来实现的。
所有这些都相对简单,我们已经创建了一个脚本,通过迭代游标并调用存储过程来克隆表 B 中的行来解决这个问题。现在的问题是 A 和 B 表都很大,而且还有一个表 A 中的大量组指向 B 中的同一行。
我们最终得到的是(执行几分钟后)事务日志被填满并崩溃。我们甚至尝试将工作分成合理大小的批次并逐一运行它们,但这最终也会填满日志。
除了以某种方式清理日志之外,是否有某种方法可以更快地处理 SQL Server 中的数据批量插入/更新,并且根本不会破坏日志?
We need to perform the following operation in our database :
There is a table A which has column B_ID that is a foreign key to the table B. There are many rows in the table A that have the same value of B_ID and we want to fix this by cloning the corresponding rows in B and redirecting the rows from A to them.
All this is relatively simple and we have already created a script that solves this by iterating over a cursor and calling a stored procedure for cloning the row in table B. Now the problem is that both A and B tables are huge and there is also a huge number of the groups within table A pointing to the same row in B.
What we end up with is (after a couple of minutes of execution) is filling up the transaction log and crashing. We have even tried to divide the work into batches of reasonable size and run them one by one, but this also eventually fills up the log.
Apart from somehow cleaning up the log, is there some way to handle bulk inserts / updates of data in SQL Server that would be faster and not blow up the log at all ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这是批量执行此操作的另一种方法(无光标)。 @KM 看起来应该可以工作,但对我来说它看起来有点慢/可怕,涉及大量锁定和扫描;如果您将工作集限制为仅新行,那么它应该非常快。
以下是测试数据的设置脚本:
因此,我们有一个 1:Many,我们希望将其设为 1:1。为此,首先对更新列表进行排队(我们将在一些其他组唯一列上对此进行索引,以加快稍后的合并速度):
结果将为每个需要更新的标记提供一行获得新颜色。然后插入新颜色并捕获完整输出:
最后合并它(这是临时表上的额外索引派上用场的地方):
这应该非常高效,因为它只需要查询生产表一次。其他一切都将针对临时表中相对较小的数据进行操作。
测试结果:
这是我们的输出:
这应该是您想要的,对吧?没有光标,没有严重的丑陋。如果它占用了太多内存或 tempdb 空间,那么您可以用索引物理暂存表替换临时表/表变量。即使有几百万行,也不可能填满事务日志并崩溃。
Here's another way to do this in a batch (no cursors). @KM's looks like it should work but it looks a little slow/scary to me with lots of locking and scans involved; if you restrict the working set to only the new rows then it should be pretty fast.
Here's the setup script for the test data:
So we have a 1:Many and we want to make this a 1:1. To do this, first queue up a list of updates (we'll index this over some other set of unique columns to speed up merging later):
The result will have one row for every marker that needs to get a new colour. Then insert the new colours and capture the full output:
And finally merge it (here's where that extra index on the temp table comes in handy):
This should be very efficient because it only ever has to query the production tables once. Everything else will be operating on the relatively small data in the temp tables.
Test the results:
Here's our output:
This should be what you want, right? No cursors, no serious ugliness. If it chews up too much memory or tempdb space then you can replace the temp table / table variable with an indexed physical staging table. Even with several million rows, there's no way this should fill up the transaction log and crash.
如果您可以使操作脱机,则可以更改数据库的恢复模式,进行更改,然后将恢复模式改回来。
总的来说,事务日志是为了保护您、允许回滚等,并且当您出于跟踪目的进行删除等操作时,它会变得更大。
注意:使用这种方法时一定要先有一个非常好的备份......
If you can take the operation offline you could change the recovery model of the database, make your changes, then change the recovery model back.
Overall though the transaction log is there to protect you, to allow for rollback etc, and it will get larger as you do deletes etc for tracking purposes.
NOTE: using this approach be sure to have a darn good backup first....
我无法想象你为什么要这样做。当前的一对多关系有什么问题?您现在不是需要更大的表来执行所有工作吗?
但是,鉴于您想要执行此操作,首先您要进行事务日志备份,频率如何?如果频率低于每十五分钟一次,则进行更改。当您备份日志时,日志会被截断,如果您不备份日志,那么日志会不断增长,直到空间耗尽。另外,您为日志指定的增长百分比也可能太小。增加它,它也可能对你有帮助。
您可以尝试在 SSIS 中完成这项工作,但我不知道这是否真的有助于解决日志记录问题。不过,这将有助于提高完成任务的性能。
I cant conceive of why you would want to do this. What's wrong with the current one to many relationship? Aren't you now going to have much larger tables to perform all your work against?
However given that you want to do this, first are you taking transaction log backups, how frequently? If it is less frequently than every fifteen minutes, then change that. When you backup the log, the log gets truncated, if you do not backup the log then it grows until you run out of space. Also perhaps the growth percentage you specify for the log is too small. Increase that and it might help you out as well.
You can try doing the work in SSIS but I don't know if that will help the logging issue really. It will help improve performance in doing the task though.
我不确定这将如何在很多行上工作,但尝试一下:
输出:
I'm not sure how this will work on lots of rows, but give it a try:
OUTPUT:
如果您要从多对一(多 A 到 1 B)关系转变为一对一(1 A 到 1 B)关系,那么在我看来,最简单的途径是在 A 中创建字段来支持此关系然后对 A 进行简单更新,将 B 中的值复制到其中。
通过这种方式,您可以完全摆脱 B,并且可以在一个更新查询中执行更改。像这样的东西:
If you are moving from a many to one (many A to one B) relationship into a one to one (one A to one B), then it seems to me that the simplest route would be to create the fields in A to support this then do a simple update on A to copy the values from B into it.
This way you get rid of B altogether, and you can perform the change in one update query. Something like:
这就是我所做的:
创建一个查询,它返回两个表(A,B)中的数据完全相同
需要位于最终表 (C) 中并将其放入 ExtractData.sql 文件中:
然后在 cmd 窗口中,执行以下命令将数据提取到文件中:
为了避免填充日志,请尝试将数据库恢复模式设置为插入之前很简单:
然后执行
TRUNCATE TABLE C
(如果您需要清除旧数据 - 它不会像删除那样添加到日志中)。然后在cmd窗口中执行此命令将数据批量加载到表C中:
错误记录将显示在ExtractData.err文件中,因此如果需要调整
表 C 的架构您可以调整/截断/重新加载提取的数据,这样您就不会
每次都需要运行查询。
完成后将恢复模式设置回 FULL:
This is what I do:
Create a query that returns the data from the two tables (A, B) exactly as it
needs to be in the final table (C) and put that into the ExtractData.sql file:
Then in a cmd window, execute this command to extract the data to a file:
To avoid filling your logs, try setting the DB recovery mode to simple before the insert:
Then do a
TRUNCATE TABLE C
(if you need to clean out old data - it doesn't add to the logs like deletes).Then in a cmd window, execute this command to bulk load the data into table C:
Error records will show up in the ExtractData.err file, so if you need to adjust the
schema of table C you can adjust / truncate / re-load the extracted data so you don't
need to run the query every time.
then set the recovery mode back to FULL after you are done: