删除重复数据的最快技术

发布于 2024-09-14 22:52:51 字数 1532 浏览 3 评论 0原文

在搜索 stackoverflow.com 后，我发现了几个询问如何删除重复项的问题，但没有一个解决速度问题。

就我而言，我有一个包含 10 列的表，其中包含 500 万个精确的行重复项。此外，我还有至少一百万行在 10 列中的 9 列中存在重复项。我当前的技术（到目前为止）需要 3 小时来删除这 500 万行。这是我的过程：

-- Step 1:  **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
    MAX(prikey) as MaxPriKey, -- identity(1, 1)
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
into #dupTemp
FROM sourceTable
group by
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
having COUNT(*) > 1

接下来，

-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete 
from sourceTable
from sourceTable
inner join #dupTemp on  
    sourceTable.a = #dupTemp.a and
    sourceTable.b = #dupTemp.b and
    sourceTable.c = #dupTemp.c and
    sourceTable.d = #dupTemp.d and
    sourceTable.e   = #dupTemp.e and
    sourceTable.f = #dupTemp.f and
    sourceTable.g = #dupTemp.g and
    sourceTable.h = #dupTemp.h and
    sourceTable.i   = #dupTemp.i and
    sourceTable.PriKey != #dupTemp.MaxPriKey

关于如何加快速度或更快的方法有什么建议吗？请记住，对于不完全重复的行，我必须再次运行此操作。

非常感谢。

更新：
我不得不在 9 小时处停止运行第 2 步。我尝试了OMG Ponies的方法，只用了40分钟就完成了。我用 Andomar 的批量删除尝试了步骤 2，它运行了 9 个小时才停止。更新：使用 OMG Ponies 的方法运行了一个类似的查询，但少了一个字段以消除一组不同的重复项，并且该查询仅运行了 4 分钟（8000 行）。

下次有机会我会尝试 cte 技术，但是，我怀疑 OMG Ponies 的方法将很难被击败。

原文

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.

In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:

-- Step 1:  **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
    MAX(prikey) as MaxPriKey, -- identity(1, 1)
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
into #dupTemp
FROM sourceTable
group by
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
having COUNT(*) > 1

Next,

-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete 
from sourceTable
from sourceTable
inner join #dupTemp on  
    sourceTable.a = #dupTemp.a and
    sourceTable.b = #dupTemp.b and
    sourceTable.c = #dupTemp.c and
    sourceTable.d = #dupTemp.d and
    sourceTable.e   = #dupTemp.e and
    sourceTable.f = #dupTemp.f and
    sourceTable.g = #dupTemp.g and
    sourceTable.h = #dupTemp.h and
    sourceTable.i   = #dupTemp.i and
    sourceTable.PriKey != #dupTemp.MaxPriKey

Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.

Thanks so much.

UPDATE:
I had to stop step 2 from running at the 9 hour mark.
I tried OMG Ponies' method and it finished after only 40 minutes.
I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it.
UPDATE:
Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.

I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

简单爱 2024-09-21 22:52:51

存在怎么样：

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)

What about EXISTS:

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)

回复收藏 0 原文

肤浅与狂妄 2024-09-21 22:52:51

您能承受原表在短时间内不可用的情况吗？

我认为最快的解决方案是创建一个没有重复项的新表。基本上是与临时表一起使用的方法，但创建了一个“常规”表。

然后删除原始表并将中间表重命名为与旧表同名。

回复收藏 0 原文

Hello爱情风 2024-09-21 22:52:51

批量行删除的瓶颈通常是 SQL Server 必须建立的事务。通过将删除操作分成较小的事务，您也许可以大大加快速度。例如，一次删除 100 行：

while 1=1
    begin

    delete top 100
    from sourceTable 
    ...

    if @@rowcount = 0
        break
    end

The bottleneck in bulk row deletion is usually the transaction that SQL Server has to build up. You might be able to speed it up considerably by splitting the removal into smaller transactions. For example, to delete 100 rows at a time:

while 1=1
    begin

    delete top 100
    from sourceTable 
    ...

    if @@rowcount = 0
        break
    end

回复收藏 0 原文

甜｀诱少女 2024-09-21 22:52:51

...基于上面 OMG Ponies 的评论，这是一种更紧凑的 CTE 方法。此方法在您（无论出于何种原因）没有主键的表上产生奇迹 - 您可以在所有列上拥有相同的行。

;WITH cte AS (
 SELECT ROW_NUMBER() OVER 
          (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence
    FROM sourceTable
)
DELETE
FROM cte
WHERE sequence > 1

...based on OMG Ponies comment above, a CTE method that's a little more compact. This method works wonders on tables where you've (for whatever reason) no primary key - where you can have rows which are identical on all columns.

;WITH cte AS (
 SELECT ROW_NUMBER() OVER 
          (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence
    FROM sourceTable
)
DELETE
FROM cte
WHERE sequence > 1

回复收藏 0 原文

趴在窗边数星星i 2024-09-21 22:52:51

嗯，有很多不同的东西。首先会做这样的工作（做一个选择或确保，甚至可能放入它自己的临时表中，#recordsToDelete）：

delete  
from sourceTable 
left join #dupTemp on   
       sourceTable.PriKey = #dupTemp.MaxPriKey   
where #dupTemp.MaxPriKey  is null

接下来你可以索引临时表，在prikey上放置索引

如果你的临时表中有记录对于要删除的表，您可以批量删除，这通常比通过删除锁定整个表更快。

Well lots of differnt things. First would something like this work (do a select o make sure, maybe even put into a temp table of it's own, #recordsToDelete):

delete  
from sourceTable 
left join #dupTemp on   
       sourceTable.PriKey = #dupTemp.MaxPriKey   
where #dupTemp.MaxPriKey  is null

Next you can index temp tables, put an index on prikey

If you have records in a temp table of the ones you want to delete, you can delete in batches which is often faster than locking up the whole table with a delete.

回复收藏 0 原文

喜你已久 2024-09-21 22:52:51

在这个版本中，您可以将这两个步骤合并为一个步骤。

WITH cte AS
    ( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY
        prikey DESC) AS sequence
    FROM sourceTable
    )

DELETE
FROM sourceTable
WHERE prikey IN
    ( SELECT prikey
    FROM cte
    WHERE sequence > 1
    ) ;

顺便问一下，有没有可以暂时删除的索引？

Here's a version where you can combine both steps into a single step.

WITH cte AS
    ( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY
        prikey DESC) AS sequence
    FROM sourceTable
    )

DELETE
FROM sourceTable
WHERE prikey IN
    ( SELECT prikey
    FROM cte
    WHERE sequence > 1
    ) ;

By the way, do you have any indexes that can be temporarily removed?

回复收藏 0 原文

紫竹語嫣☆ 2024-09-21 22:52:51

如果您使用的是 Oracle 数据库，我最近发现从总持续时间以及 CPU 消耗的角度来看，以下语句执行效果最佳。
我已经使用不同的数据大小（从数十行到数千行）执行了多次测试，并且总是在循环中。我使用TKProf工具来分析结果。

与上面的 ROW_NUMBER() 解决方案相比，这种方法花费了原始时间的 2/3，并且消耗了大约 50% 的 CPU 时间。它似乎表现出线性，即对于任何输入数据大小都应该给出相似的结果。

请随时向我提供您的反馈。我想知道是否有更好的方法。

DELETE FROM sourceTable
WHERE
    ROWID IN(
        -- delete all
        SELECT ROWID
        FROM sourceTable t
        MINUS
        -- but keep every unique row
        SELECT
            rid
        FROM
            (
            SELECT a,b,c,d,e,f,g,h,i, MAX(ROWID) KEEP (DENSE_RANK FIRST ORDER BY ROWID) AS RID
            FROM sourceTable t
            GROUP BY a,b,c,d,e,f,g,h,i
            )
    )
;

If you're using Oracle database, I recently found out that following statement performs best, from total durtion time as well as CPU consumption point of view.
I've performed several test with different data sizes from tens of rows to thousands, always in a loop. I used TKProf tool to analyze the results.

When compared to ROW_NUMBER() solution above, this approach took 2/3 of the original time and consumed about 50% of the CPU time. It seemed to behave linearly, ie it should give similar results with any input data size.

Feel free to give me your feedback. I wonder if there is a better method.

DELETE FROM sourceTable
WHERE
    ROWID IN(
        -- delete all
        SELECT ROWID
        FROM sourceTable t
        MINUS
        -- but keep every unique row
        SELECT
            rid
        FROM
            (
            SELECT a,b,c,d,e,f,g,h,i, MAX(ROWID) KEEP (DENSE_RANK FIRST ORDER BY ROWID) AS RID
            FROM sourceTable t
            GROUP BY a,b,c,d,e,f,g,h,i
            )
    )
;

回复收藏 0 原文

~没有更多了~