将 SQL Server 数据库中超过十亿行的一列复制到另一列

发布于 2024-09-24 16:34:17 字数 712 浏览 3 评论 0原文

数据库:SQL Server 2005

问题:将十亿以上的值从同一个表中的一列复制到另一列 行。

test_table (int id, bigint bigid)

尝试的事情1:更新查询

update test_table set bigid = id 

填满事务日志并由于事务日志空间不足而回滚。

尝试过 2 - 以下几行的过程

set nocount on
set rowcount = 500000
while @rowcount > 0
begin
 update test_table set bigid = id where bigid is null
 set @rowcount = @@rowcount
 set @rowupdated = @rowsupdated + @rowcount
end
print @rowsupdated

上面的过程随着进行而开始变慢。

尝试 3 - 创建用于更新的游标。

SQL Server 文档中通常不鼓励这种方法,并且这种方法一次更新一行,这太耗时了。

有没有一种方法可以加快将值从一列复制到另一列的速度。基本上,我正在寻找一些“神奇”的关键字或逻辑,这些关键字或逻辑将允许更新查询一次依次遍历十亿行、五十万行。

任何提示、指示将不胜感激。

Database : SQL Server 2005

Problem : Copy values from one column to another column in the same table with a billion+
rows.

test_table (int id, bigint bigid)

Things tried 1: update query

update test_table set bigid = id 

fills up the transaction log and rolls back due to lack of transaction log space.

Tried 2 - a procedure on following lines

set nocount on
set rowcount = 500000
while @rowcount > 0
begin
 update test_table set bigid = id where bigid is null
 set @rowcount = @@rowcount
 set @rowupdated = @rowsupdated + @rowcount
end
print @rowsupdated

The above procedure starts slowing down as it proceeds.

Tried 3 - Creating a cursor for update.

generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.

Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.

Any hints, pointers will be much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

箜明 2024-10-01 16:34:17

我猜测您即将接近列的人工键上 INT 数据类型的 21 亿个限制。是的,那很痛苦。在事实发生之前修复比在您实际达到该限制并且在尝试修复它时生产被关闭之后要容易得多:)

无论如何,这里的一些想法是可行的。不过,我们来谈谈速度、效率、索引和日志大小。

日志增长

日志最初爆炸是因为它试图一次提交所有 2b 行。其他帖子中关于“分块”的建议会起作用,但可能无法完全解决日志问题。

如果数据库处于 SIMPLE 模式,那就没问题(日志将在每个批次后重新使用自身)。如果数据库处于 FULL 或 BULK_LOGGED 恢复模式,则必须在操作运行期间频繁运行日志备份,以便 SQL 可以重新使用日志空间。这可能意味着增加这段时间的备份频率,或者只是在运行时监视日志使用情况。

索引和速度

随着表的填充,所有 where bigid is null 答案都会变慢,因为(大概)新的 BIGID 字段上没有索引。 (当然)您可以只在 BIGID 上添加索引,但我不相信这是正确的答案。

关键(双关语)是我的假设,即原始 ID 字段可能是主键或聚集索引,或两者兼而有之。在这种情况下,让我们利用这一事实,对 Jess 的想法进行变体:

set @counter = 1
while @counter < 2000000000 --or whatever
begin
  update test_table set bigid = id 
  where id between @counter and (@counter + 499999) --BETWEEN is inclusive
  set @counter = @counter + 500000
end

由于 ID 上现有索引,这应该非常快。

无论如何,ISNULL 检查确实没有必要,间隔上的 (-1) 也不是必要的。如果我们在调用之间重复一些行,那没什么大不了的。

I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)

Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.

Log Growth

The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.

If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.

Indexes and Speed

ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.

The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:

set @counter = 1
while @counter < 2000000000 --or whatever
begin
  update test_table set bigid = id 
  where id between @counter and (@counter + 499999) --BETWEEN is inclusive
  set @counter = @counter + 500000
end

This should be extremely fast, because of the existing indexes on ID.

The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.

余生共白头 2024-10-01 16:34:17

UPDATE 语句中使用 TOP:

UPDATE TOP (@row_limit) dbo.test_table
   SET bigid = id 
 WHERE bigid IS NULL

Use TOP in the UPDATE statement:

UPDATE TOP (@row_limit) dbo.test_table
   SET bigid = id 
 WHERE bigid IS NULL
一个人的夜不怕黑 2024-10-01 16:34:17

您可以尝试使用诸如 SET ROWCOUNT 之类的方法并进行批量更新:

SET ROWCOUNT 5000;

UPDATE dbo.test_table 
SET bigid = id 
WHERE bigid IS NULL
GO

然后根据需要重复此操作多次。

这样,您就可以避免游标和 while 循环的 RBAR(逐行痛苦)症状,而且不会不必要地填满事务日志。

当然,在运行之间,您必须进行备份(尤其是日志)以将其大小保持在合理的范围内。

You could try to use something like SET ROWCOUNT and do batch updates:

SET ROWCOUNT 5000;

UPDATE dbo.test_table 
SET bigid = id 
WHERE bigid IS NULL
GO

and then repeat this as many times as you need to.

This way, you're avoiding the RBAR (row-by-agonizing-row) symptoms of cursors and while loops, and yet, you don't unnecessarily fill up your transaction log.

Of course, in between runs, you'd have to do backups (especially of your log) to keep its size within reasonable limits.

倾城月光淡如水﹏ 2024-10-01 16:34:17

这是一次性的事情吗?如果是这样,只需按范围进行:

set counter = 500000
while @counter < 2000000000 --or whatever your max id
begin
 update test_table set bigid = id where id between (@counter - 500000) and @counter and bigid is null
 set counter = @counter + 500000
end

Is this a one time thing? If so, just do it by ranges:

set counter = 500000
while @counter < 2000000000 --or whatever your max id
begin
 update test_table set bigid = id where id between (@counter - 500000) and @counter and bigid is null
 set counter = @counter + 500000
end
牵你手 2024-10-01 16:34:17

我没有运行这个来尝试,但如果你能让它一次更新 500k,我认为你正在朝着正确的方向前进。

set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL

您还可以尝试更改恢复模型,这样就不会记录事务

ALTER DATABASE db1
SET RECOVERY SIMPLE
GO

update test_table
set bigid = id
GO

ALTER DATABASE db1
SET RECOVERY FULL
GO

I didn't run this to try it, but if you can get it to update 500k at a time I think you're moving in the right direction.

set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL

You can also try changing the recover model so you don't log the transactions

ALTER DATABASE db1
SET RECOVERY SIMPLE
GO

update test_table
set bigid = id
GO

ALTER DATABASE db1
SET RECOVERY FULL
GO
简美 2024-10-01 16:34:17

第一步(如果有的话)是在操作之前删除索引。这可能是导致速度随时间下降的原因。

另一个选项,有点跳出框框思考......你能以一种可以在选择中具体化列值的方式表达更新吗?如果您能做到这一点,那么您可以使用 SELECT INTO 创建相当于一个新表的内容,这是一个最小日志记录操作(假设在 2005 年您设置为 SIMPLE 或 BULK LOGGED 恢复模型)。这将非常快,然后您可以删除旧表,将该表重命名为旧表名称并重新创建任何索引。

select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'

First step, if there are any, would be to drop indexes before the operation. This is probably what is causing the speed degrade with time.

The other option, a little outside the box thinking...can you express the update in such a way that you could materialize the column values in a select? If you can do this then you could create what amounts to a NEW table using SELECT INTO which is a minimally logged operation (assuming in 2005 that you are set to a recovery model of SIMPLE or BULK LOGGED). This would be pretty fast and then you can drop the old table, rename this table to to old table name and recreate any indexes.

select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'
酸甜透明夹心 2024-10-01 16:34:17

我附议
UPDATE TOP(X) 语句

另外建议,如果您处于循环中,请在之间添加一些 WAITFOR 延迟或 COMMIT,以允许其他进程在需要时有时间使用该表,而不是永远阻塞,直到所有更新完成

I second the
UPDATE TOP(X) statement

Also to suggest, if you're in a loop, add in some WAITFOR delay or COMMIT between, to allow other processes some time to use the table if needed vs. blocking forever until all the updates are completed

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文