SQL批量删除

发布于 2024-07-21 13:49:02 字数 242 浏览 11 评论 0原文

我在 SQL Server 2005 中有一个表,其中约有 40 亿行。 我需要删除大约 20 亿行。 如果我尝试在单个事务中执行此操作,事务日志就会填满并且失败。 我没有任何额外的空间来使事务日志更大。 我认为最好的方法是批量删除语句(批量约为 10,000 个?)。

我可能可以使用光标来做到这一点,但是这是一种标准/简单/聪明的方法吗?

PS 该表没有作为 PK 的标识列。 PK 由整数外键和日期组成。

I have a table in SQL Server 2005 which has approx 4 billion rows in it. I need to delete approximately 2 billion of these rows. If I try and do it in a single transaction, the transaction log fills up and it fails. I don't have any extra space to make the transaction log bigger. I assume the best way forward is to batch up the delete statements (in batches of ~ 10,000?).

I can probably do this using a cursor, but is the a standard/easy/clever way of doing this?

P.S. This table does not have an identity column as a PK. The PK is made up of an integer foreign key and a date.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

回忆那么伤 2024-07-28 13:49:03

我同意那些希望循环遍历一组较小记录的人的观点,这比尝试一步完成整个操作要快。 您可能会体验到应在循环中包含的记录数量。 一次大约 2000 似乎是大多数表中的最佳位置,尽管有一些表需要较小的数量(例如 500),但我会进行大量增量删除。取决于外键的数量、记录的大小、触发器等,所以它确实需要一些尝试来找到你需要的东西。 它还取决于桌子的使用量。 访问频繁的表将需要循环的每次迭代来运行更短的时间。 如果您可以在非工作时间运行,或者最好在单用户模式下运行,那么您可以在一个循环中删除更多记录。

如果您认为在非工作时间的一晚内无法完成此操作,那么最好使用计数器设计循环,并且每晚仅执行一组迭代,直到完成为止。

此外,如果您使用隐式事务而不是显式事务,则可以随时终止循环查询,并且已删除的记录将保持删除状态,除了当前一轮循环中的记录之外。 比尝试回滚 50 万条记录要快得多,因为您已经使系统停止了。

在进行此类操作之前立即备份数据库通常是一个好主意。

I agree with the people who want you loop over a smaller set of records, this will be faster than trying to do the whole operation in one step. You may to experience withthe number of records you should include inthe loop. About 2000 at a time seems to be the sweet spot in most of the tables I do large deltes from althouhg a few need smaller amounts like 500. Depends on number of forign keys, size of the record, triggers etc, so it really will take some experimenting to find what you need. It also depends on how heavy the use of the table is. A heavily accessed table will need each iteration of the loop to run a shorter amount of time. If you can run during off hours, or best yet in single user mode, then you can have more records deleted in one loop.

If you don't think you do this in one night during off hours, it might be best to design the loop with a counter and only do a set number of iterations each night until it is done.

Further, if you use an implicit transaction rather than an explicit one, you can kill the loop query at any time and records already deleted will stay deleted except those in the current round of the loop. Much faster than trying to rollback half a million records becasue you've brought the system to a halt.

It is usually a good idea to backup a database immediately before undertaking an operation of this nature.

咋地 2024-07-28 13:49:02

您可以“蚕食”删除,这也意味着您不会对数据库造成大量负载。 如果您的 t-log 备份每 10 分钟运行一次,那么您应该可以在相同的时间间隔内运行一次或两次。 您可以将其安排为 SQL 代理作业,

尝试如下操作:

DECLARE @count int
SET @count = 10000

    DELETE  FROM table1 
    WHERE table1id IN (
        SELECT TOP (@count) tableid
        FROM table1
        WHERE x='y'
    )

You can 'nibble' the delete's which also means that you don't cause a massive load on the database. If your t-log backups run every 10 mins, then you should be ok to run this once or twice over the same interval. You can schedule it as a SQL Agent job

try something like this:

DECLARE @count int
SET @count = 10000

    DELETE  FROM table1 
    WHERE table1id IN (
        SELECT TOP (@count) tableid
        FROM table1
        WHERE x='y'
    )
水水月牙 2024-07-28 13:49:02

您想要删除的行与想要保留的行有何区别? 这对你有用吗:

while exists (select 1 from your_table where <your_condition>)
delete top(10000) from your_table
where <your_condition>

What distinguishes the rows you want to delete from those you want to keep? Will this work for you:

while exists (select 1 from your_table where <your_condition>)
delete top(10000) from your_table
where <your_condition>
相权↑美人 2024-07-28 13:49:02

这是我的例子:

-- configure script
-- Script limits - transaction per commit (default 10,000)
-- And time to allow script to run (in seconds, default 2 hours)
--
DECLARE @MAX INT
DECLARE @MAXT INT
--
-- These 4 variables are substituted by shell script.
--
SET @MAX = $MAX
SET @MAXT = $MAXT
SET @TABLE = $TABLE
SET @WHERE = $WHERE

-- step 1 - Main loop
DECLARE @continue INT
-- deleted in one transaction
DECLARE @deleted INT
-- deleted total in script
DECLARE @total INT
SET @total = 0
DECLARE @max_id INT, @start_id INT, @end_id INT, @interval INT
SET @interval = @MAX
SELECT @start_id = MIN(id), @max_id = MAX(id) from @TABLE
SET @end_id = @start_id + @interval

-- timing
DECLARE @start DATETIME
DECLARE @now DATETIME
DECLARE @timee INT
SET @start = GETDATE()
-- 
SET @continue = 1
IF OBJECT_ID (N'EntryID', 'U') IS NULL 
BEGIN
    CREATE TABLE EntryID (startid INT)
    INSERT INTO EntryID(startid) VALUES(@start_id)
END
    ELSE
BEGIN
    SELECT @start_id = startid FROM EntryID
END


WHILE (@continue = 1 AND @start_id <= @max_id)
BEGIN

    PRINT 'Start issued:   ' + CONVERT(varchar(19), GETDATE(), 120)
    BEGIN TRANSACTION
        DELETE 
        FROM @TABLE
        WHERE id BETWEEN @start_id AND @end_id AND @WHERE
        SET @deleted = @@ROWCOUNT
    UPDATE EntryID SET EntryID.startid = @end_id + 1
    COMMIT
    PRINT 'Deleted issued: ' + STR(@deleted) + ' records. ' + CONVERT(varchar(19), GETDATE(), 120) 
    SET @total = @total + @deleted
    SET @start_id = @end_id + 1
    SET @end_id = @end_id + @interval
    IF @end_id > @max_id
        SET @end_id = @max_id

    SET @now = GETDATE()
    SET @timee = DATEDIFF (second, @start, @now)
    if @timee > @MAXT
    BEGIN
    PRINT 'Time limit exceeded for the script, exiting'
    SET @continue = 0
    END
--    ELSE
--    BEGIN
--      SELECT @total 'Removed now', @timee 'Total time, seconds'   
--    END
END

SELECT @total 'Removed records', @timee 'Total time sec' , @start_id 'Next id', @max_id 'Max id', @continue 'COMPLETED? '
SELECT * from EntryID next_start_id

GO

Here is my example:

-- configure script
-- Script limits - transaction per commit (default 10,000)
-- And time to allow script to run (in seconds, default 2 hours)
--
DECLARE @MAX INT
DECLARE @MAXT INT
--
-- These 4 variables are substituted by shell script.
--
SET @MAX = $MAX
SET @MAXT = $MAXT
SET @TABLE = $TABLE
SET @WHERE = $WHERE

-- step 1 - Main loop
DECLARE @continue INT
-- deleted in one transaction
DECLARE @deleted INT
-- deleted total in script
DECLARE @total INT
SET @total = 0
DECLARE @max_id INT, @start_id INT, @end_id INT, @interval INT
SET @interval = @MAX
SELECT @start_id = MIN(id), @max_id = MAX(id) from @TABLE
SET @end_id = @start_id + @interval

-- timing
DECLARE @start DATETIME
DECLARE @now DATETIME
DECLARE @timee INT
SET @start = GETDATE()
-- 
SET @continue = 1
IF OBJECT_ID (N'EntryID', 'U') IS NULL 
BEGIN
    CREATE TABLE EntryID (startid INT)
    INSERT INTO EntryID(startid) VALUES(@start_id)
END
    ELSE
BEGIN
    SELECT @start_id = startid FROM EntryID
END


WHILE (@continue = 1 AND @start_id <= @max_id)
BEGIN

    PRINT 'Start issued:   ' + CONVERT(varchar(19), GETDATE(), 120)
    BEGIN TRANSACTION
        DELETE 
        FROM @TABLE
        WHERE id BETWEEN @start_id AND @end_id AND @WHERE
        SET @deleted = @@ROWCOUNT
    UPDATE EntryID SET EntryID.startid = @end_id + 1
    COMMIT
    PRINT 'Deleted issued: ' + STR(@deleted) + ' records. ' + CONVERT(varchar(19), GETDATE(), 120) 
    SET @total = @total + @deleted
    SET @start_id = @end_id + 1
    SET @end_id = @end_id + @interval
    IF @end_id > @max_id
        SET @end_id = @max_id

    SET @now = GETDATE()
    SET @timee = DATEDIFF (second, @start, @now)
    if @timee > @MAXT
    BEGIN
    PRINT 'Time limit exceeded for the script, exiting'
    SET @continue = 0
    END
--    ELSE
--    BEGIN
--      SELECT @total 'Removed now', @timee 'Total time, seconds'   
--    END
END

SELECT @total 'Removed records', @timee 'Total time sec' , @start_id 'Next id', @max_id 'Max id', @continue 'COMPLETED? '
SELECT * from EntryID next_start_id

GO
Hello爱情风 2024-07-28 13:49:02

听起来这是一次性操作(我希望你如此),并且你不需要返回到批量删除一半的状态 - 如果是这种情况,为什么不在运行之前切换到简单事务模式并完成后返回 FULL?

这样事务日志就不会增长得太多。 在大多数情况下,这可能并不理想,但我在这里没有看到任何问题(假设如上所述,您不需要返回到删除之间的状态)。

您可以在脚本中使用 smt 执行此操作,如下所示:

ALTER DATABASE myDB SET RECOVERY FULL/SIMPLE

或者您可以设置一个作业,在每个给定的时间间隔缩小事务日志 - 在删除运行时。 这有点糟糕,但我认为它可以解决问题。

Sounds like this is one-off operation (I hope for you) and you don't need to go back to a state that's halfway this batched delete - if that's the case why don't you just switch to SIMPLE transaction mode before running and then back to FULL when you're done?

This way the transaction log won't grow as much. This might not be ideal in most situations but I don't see anything wrong here (assuming as above you don't need to go back to a state that's in between your deletes).

you can do this in your script with smt like:

ALTER DATABASE myDB SET RECOVERY FULL/SIMPLE

Alternatively you can setup a job to shrink the transaction log every given interval of time - while your delete is running. This is kinda bad but I reckon it'd do the trick.

听闻余生 2024-07-28 13:49:02

好吧,如果您使用 SQL Server 分区(例如基于日期列),您可能会切换出不再需要的分区。 也许是对未来实施的考虑。

我认为最好的选择可能是像你所说的那样,小批量地删除数据,而不是一次删除,以避免任何潜在的阻塞问题。

您还可以考虑以下方法:

  1. 将数据复制到临时表中
  2. 截断原始表以清除所有数据 将
  3. 临时表中的所有内容移回原始表

当数据添加回原始表时,您的索引也会被重建原始表。

Well, if you were using SQL Server Partitioning, say based on the date column, you would have possibly switched out the partitions that are no longer required. A consideration for a future implementation perhaps.

I think the best option may be as you say, to delete the data in smaller batches, rather than in one hit, so as to avoid any potential blocking issues.

You could also consider the following method:

  1. Copy the data to keep into a temporary table
  2. Truncate the original table to purge all data
  3. Move everything from the temporary table back into the original table

Your indexes would also be rebuilt as the data was added back to the original table.

未央 2024-07-28 13:49:02

我会做类似于临时表建议的事情,但我会选择您想要保留的行到一个新的永久表中,删除原始表,然后重命名新表。 这应该具有相对较低的传输日志影响。 显然,请记住在重命名新表后重新创建新表所需的任何索引。

只是我的两便士。

I would do something similar to the temp table suggestions but I'd select into a new permanent table the rows you want to keep, drop the original table and then rename the new one. This should have a relatively low tran log impact. Obviously remember to recreate any indexes that are required on the new table after you've renamed it.

Just my two p'enneth.

生来就爱笑 2024-07-28 13:49:02

除了将其放入带有截断日志的语句的批处理中之外,您可能还想尝试以下技巧:

  • 除了其他条件之外,添加与聚集索引中的第一列匹配的条件
  • 从表中删除所有索引,然后如果可能的话,在删除完成后将它们放回原处,并且不会干扰数据库中发生的任何其他事情,但保留聚集索引

对于上面的第一点,例如,如果您的 PK 是聚集的,则找到一个大约范围匹配您要删除每个批次的行数并使用:

DECLARE @max_id INT, @start_id INT, @end_id INT, @interval INT
SELECT @start_id = MIN(id), @max_id = MAX(id) FROM My_Table
SET @interval = 100000  -- You need to determine the right number here
SET @end_id = @start_id + @interval

WHILE (@start_id <= @max_id)
BEGIN
     DELETE FROM My_Table WHERE id BETWEEN @start_id AND @end_id AND <your criteria>

     SET @start_id = @end_id + 1
     SET @end_id = @end_id + @interval
END

In addition to putting this in a batch with a statement to truncate the log, you also might want to try these tricks:

  • Add criteria that matches the first column in your clustered index in addition to your other criteria
  • Drop any indexes from the table and then put them back after the delete is done if that's possible and won't interfere with anything else going on in the DB, but KEEP the clustered index

For the first point above, for example, if your PK is clustered then find a range which approximately matches the number of rows that you want to delete each batch and use that:

DECLARE @max_id INT, @start_id INT, @end_id INT, @interval INT
SELECT @start_id = MIN(id), @max_id = MAX(id) FROM My_Table
SET @interval = 100000  -- You need to determine the right number here
SET @end_id = @start_id + @interval

WHILE (@start_id <= @max_id)
BEGIN
     DELETE FROM My_Table WHERE id BETWEEN @start_id AND @end_id AND <your criteria>

     SET @start_id = @end_id + 1
     SET @end_id = @end_id + @interval
END
山田美奈子 2024-07-28 13:49:02

简而言之,您无法删除 20 亿行而不导致某种主要的数据库停机。

您最好的选择可能是将数据复制到临时表并截断原始表,但这将填充您的临时数据库,并且使用的日志记录不会少于删除数据。

您需要删除尽可能多的行,直到事务日志填满,然后每次都将其截断。 Stanislav Kniazev 提供的答案可以通过增加批处理大小并添加调用来截断日志文件来修改以实现此目的。

The short answer is, you can't delete 2 billion rows without incurring some kind of major database downtime.

Your best option may be to copy the data to a temp table and truncate the original table, but this will fill your tempDB and would use no less logging than deleting the data.

You will need to delete as many rows as you can until the transaction log fills up, then truncate it each time. The answer provided by Stanislav Kniazev could be modified to do this by increasing the batch size and adding a call to truncate the log file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文