SQL 重复删除查询数百万行以提高性能

发布于 2024-07-06 09:20:11 字数 906 浏览 15 评论 0原文

这是一次冒险。 我从我之前的问题中的循环重复查询开始,但每个循环都会遍历所有1700万条记录,意味着需要数周(仅使用 MSSQL 2005 运行 *select count * from MyTable* 就需要我的服务器 4 分 30 分钟)。 我从这个网站和这个帖子中看到了信息。

并已到达下面的查询。 问题是,对于任何类型的性能来说,这都是在 1700 万条记录上运行的正确查询类型吗? 如果不是,那是什么?

SQL 查询:

DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in 
(SELECT RecordID
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    EXCEPT
    SELECT RecordID
    FROM (
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude,           Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    ) al WHERE Rank = 1)

This has been an adventure. I started with the looping duplicate query located in my previous question, but each loop would go over all 17 million records, meaning it would take weeks (just running *select count * from MyTable* takes my server 4:30 minutes using MSSQL 2005). I gleamed information from this site and at this post.

And have arrived at the query below. The question is, is this the correct type of query to run on 17 million records for any type of performance? If it isn't, what is?

SQL QUERY:

DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in 
(SELECT RecordID
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    EXCEPT
    SELECT RecordID
    FROM (
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude,           Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    ) al WHERE Rank = 1)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

深白境迁sunset 2024-07-13 09:20:11

查看 QueryPlan 会有帮助。

这可行吗?

SELECT m.*
into #temp
FROM tl_acxiomimport.dbo.tblacxiomlistings m 
inner join (SELECT RecordID, 
                   Rank() over (Partition BY BusinessName, 
                                             latitude,  
                                             longitude,            
                                             Phone  
                                ORDER BY webaddress DESC,  
                                         caption1 DESC,  
                                         caption2 DESC ) AS Rank
              FROM tl_acxiomimport.dbo.tblacxiomlistings
           ) al on (al.RecordID = m.RecordID and al.Rank = 1)

truncate table tl_acxiomimport.dbo.tblacxiomlistings

insert into tl_acxiomimport.dbo.tblacxiomlistings
     select * from #temp

Seeing the QueryPlan would help.

Is this feasible?

SELECT m.*
into #temp
FROM tl_acxiomimport.dbo.tblacxiomlistings m 
inner join (SELECT RecordID, 
                   Rank() over (Partition BY BusinessName, 
                                             latitude,  
                                             longitude,            
                                             Phone  
                                ORDER BY webaddress DESC,  
                                         caption1 DESC,  
                                         caption2 DESC ) AS Rank
              FROM tl_acxiomimport.dbo.tblacxiomlistings
           ) al on (al.RecordID = m.RecordID and al.Rank = 1)

truncate table tl_acxiomimport.dbo.tblacxiomlistings

insert into tl_acxiomimport.dbo.tblacxiomlistings
     select * from #temp
柠檬色的秋千 2024-07-13 09:20:11

您的数据库、服务器、存储或其某些组合出现问题。 4:30 对于选择计数 * 似乎非常高。

运行 DBCC_SHOWCONTIG 查看表的碎片程度,这可能会导致该大小的表的性能受到重大影响。

另外,要添加 RyanKeeter 的评论,请运行显示计划,如果有任何表扫描,则为该表上的 PK 字段创建索引。

Something's up with your DB, server, storage or some combination thereof. 4:30 for a select count * seems VERY high.

Run a DBCC_SHOWCONTIG to see how fragmented your table is, this could cause a major performance hit over a table that size.

Also, to add on to the comment by RyanKeeter, run the show plan and if there are any table scans create an index for the PK field on that table.

请止步禁区 2024-07-13 09:20:11

这样做不是更简单吗:

DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in 
(SELECT RecordID
   FROM (
        SELECT RecordID,
            Rank() over (Partition BY BusinessName,
                                  latitude,
                                  longitude,
                                  Phone
                         ORDER BY webaddress DESC,
                                  caption1 DESC,
                                  caption2 DESC) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        )
  WHERE Rank > 1
  )

Wouldn't it be more simple to do:

DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in 
(SELECT RecordID
   FROM (
        SELECT RecordID,
            Rank() over (Partition BY BusinessName,
                                  latitude,
                                  longitude,
                                  Phone
                         ORDER BY webaddress DESC,
                                  caption1 DESC,
                                  caption2 DESC) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        )
  WHERE Rank > 1
  )
梓梦 2024-07-13 09:20:11

在查询分析器中运行此命令:

SET SHOWPLAN_TEXT ON

然后要求查询分析器运行您的查询。 SQL Server 将生成一个查询计划并将其放入结果集中,而不是运行查询。

向我们展示查询计划。

Run this in query analyzer:

SET SHOWPLAN_TEXT ON

Then ask query analyzer to run your query. Instead of running the query, SQL Server will generate a query plan and put it in the result set.

Show us the query plan.

自找没趣 2024-07-13 09:20:11

1700万条记录不算什么。 如果仅执行 select count(*) 就需要 4:30,则存在严重问题,可能与服务器内存不足或处理器太旧有关。

为了提高性能,请修复机器。 将其提升至 2GB。 如今 RAM 非常便宜,其成本远远低于您的时间。

当该查询进行时,处理器或磁盘是否发生抖动? 如果没有,则说明有东西阻止了呼叫。 在这种情况下,您可能会考虑将数据库置于单用户模式,以等待运行清理所需的时间。

17 million records is nothing. If it takes 4:30 to just do a select count(*) then there is a serious problem, probably related to either lack of memory in the server or a really old processor.

For performance, fix the machine. Pump it up to 2GB. RAM is so cheap these days that its cost is far less than your time.

Is the processor or disk thrashing when that query is going? If not, then something is blocking the calls. In that case you might consider putting the database in single user mode for the amount of time it takes to run the cleanup.

爱本泡沫多脆弱 2024-07-13 09:20:11

那么您要删除所有排名不靠前的记录吗? 可能值得将连接与前 1 个子查询进行比较(这也可能在 2000 年有效,因为排名仅限于 2005 及以上)

您是否需要在单个操作中删除所有重复项? 我假设您正在执行某种整理任务,您可能可以分段完成。

基本上创建一个游标,循环所有记录(脏读)并删除每个记录的重复内容。 总体来说会慢很多,但每次操作都会相对最少。 然后你的家务管理就变成了一项持续的后台任务,而不是每晚的一批任务。

So you're deleting all the records that aren't ranked first? It might be worth comparing a join against a top 1 sub query against (which might also work in 2000, as rank is 2005 and above only)

Do you need to remove all the duplicates in a single operation? I assume that you're preforming some sort of housekeeping task, you might be able to do it piece-wise.

Basically create a cursor that loops all the records (dirty read) and removes dupes for each. It'll be a lot slower overall, but each operation will be relatively minimal. Then your housekeeping becomes a constant background task rather than a nightly batch.

甜柠檬 2024-07-13 09:20:11

上面建议的首先选择临时表是您最好的选择。 您还可以使用类似:

set rowcount 1000

在运行删除之前。 删除 1000 行后它将停止运行。 然后反复运行,直到删除 0 条记录。

The suggestion above to select into a temporary table first is your best bet. You could also use something like:

set rowcount 1000

before running your delete. It will stop running after it deletes the 1000 rows. Then run it again and again until you get 0 records deleted.

淡淡の花香 2024-07-13 09:20:11

如果我正确理解,您的查询与我认为应该运行得更快一样

DELETE tl_acxiomimport.dbo.tblacxiomlistings
FROM
    tl_acxiomimport.dbo.tblacxiomlistings allRecords
    LEFT JOIN (   
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        WHERE Rank = 1) myExceptions
    ON allRecords.RecordID = myExceptions.RecordID
WHERE
    myExceptions.RecordID IS NULL

,我倾向于避免使用“IN”子句,而尽可能使用 JOIN。

实际上,您可以通过简单地在 FROM 部分上调用 SELECT *SELECT COUNT(*) 来安全地测试速度和结果,例如,

SELECT *
FROM
    tl_acxiomimport.dbo.tblacxiomlistings allRecords
    LEFT JOIN (   
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        WHERE Rank = 1) myExceptions
    ON allRecords.RecordID = myExceptions.RecordID
WHERE
    myExceptions.RecordID IS NULL

这是我更喜欢的另一个原因加入方法
我希望这有帮助

if i get it correctly you query is the same as

DELETE tl_acxiomimport.dbo.tblacxiomlistings
FROM
    tl_acxiomimport.dbo.tblacxiomlistings allRecords
    LEFT JOIN (   
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        WHERE Rank = 1) myExceptions
    ON allRecords.RecordID = myExceptions.RecordID
WHERE
    myExceptions.RecordID IS NULL

I think that should run faster, I tend to avoid using "IN" clause in favor of JOINs where possible.

You can actually test the speed and the results safely by simply calling SELECT * or SELECT COUNT(*) on the FROM part like e.g.

SELECT *
FROM
    tl_acxiomimport.dbo.tblacxiomlistings allRecords
    LEFT JOIN (   
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        WHERE Rank = 1) myExceptions
    ON allRecords.RecordID = myExceptions.RecordID
WHERE
    myExceptions.RecordID IS NULL

That is another reason why I would prefer the JOIN approach
I hope that helps

秋日私语 2024-07-13 09:20:11

这看起来不错,但您可能会考虑将数据选择到临时表中并在删除语句中使用它。 我注意到这样做可以带来巨大的性能提升,而不是在一个查询中完成所有操作。

This looks fine but you might consider selecting your data into a temporary table and using that in your delete statement. I've noticed huge performance gains from doing this instead of doing it all in that one query.

橘虞初梦 2024-07-13 09:20:11

请记住,在进行大量删除时,最好先进行良好的备份。(而且我通常也会将已删除的记录复制到另一个表中,以防万一,我需要立即恢复它们。)

Remember when doing a large delete it is best to have a good backup first.(And I also usually copy the deleted records to another table just in case, I need to recover them right away. )

我很OK 2024-07-13 09:20:11

除了按照建议使用截断之外,我使用此模板从表中删除大量行的运气非常好。 我不记得了,但我认为使用事务有助于防止日志文件增长——但可能是另一个原因——不确定。 在执行以下操作之前,我通常将事务日志记录方法切换为简单:

SET ROWCOUNT 5000
WHILE 1 = 1
BEGIN
    begin tran
            DELETE FROM ??? WHERE ???
            IF @@rowcount = 0
            BEGIN
               COMMIT
               BREAK
            END
    COMMIT
END
SET ROWCOUNT 0

Other than using truncate as suggested, I've had the best luck using this template for deleting lots of rows from a table. I don't remember off hand, but I think using the transaction helped to keep the log file from growing -- may have been another reason though -- not sure. And I usually switch the transaction logging method over to simple before doing something like this:

SET ROWCOUNT 5000
WHILE 1 = 1
BEGIN
    begin tran
            DELETE FROM ??? WHERE ???
            IF @@rowcount = 0
            BEGIN
               COMMIT
               BREAK
            END
    COMMIT
END
SET ROWCOUNT 0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文