SQL Server 2008 重复数据删除

发布于 2024-10-08 09:25:36 字数 427 浏览 1 评论 0原文

长话短说,我接手了一个项目,数据库中的一个表非常需要重复数据删除。该表如下所示:

supply_req_id | int      | [primary key]
supply_req_dt | datetime | 
request_id    | int      | [foreign key]
supply_id     | int      | [foreign key]
is_disabled   | bit      |

存在具有相同 request_id 和 Supply_id 的记录的重复项。我想找到一种最佳实践方法来消除该表的重复数据。

[编辑]
@Kirk_Broadhurst,谢谢你的提问。由于 Supply_req_id 没有在其他地方引用,我会回答说保留第一个,删除任何后续出现的情况。

节日快乐

Long story short, I took over a project and a table in the database is in serious need of de-duping. The table looks like this:

supply_req_id | int      | [primary key]
supply_req_dt | datetime | 
request_id    | int      | [foreign key]
supply_id     | int      | [foreign key]
is_disabled   | bit      |

The duplication is exists with records having the same request_id and supply_id. I'd like to find a best practice way to de-dupe this table.

[EDIT]
@Kirk_Broadhurst, thanks for the question. Since supply_req_id is not referenced anywhere else, I would answer by saying keep the first, delete any subsequent occurances.

Happy Holidays

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

话少情深 2024-10-15 09:25:36

这会为 (supply_req_dt, request_id) 分组中的每一行创建一个排名,从 1 = 最低的 Supply_req_id 开始。任何骗局都有一个价值> 1

;WITH cDupes AS
(
    SELECT
     supply_req_id,
     ROW_NUMBER() OVER (PARTITION BY supply_req_dt, request_id ORDER BY supply_req_id) AS RowNum
    FROM
     MyTable
)
DELETE
   cDupes
WHERE
   RowNum > 1

然后添加唯一约束或INDEX

CREATE UNIQUE INDEX IXU_NoDupes ON MyTable (supply_req_dt, request_id)

This creates a rank for each row in the (supply_req_dt, request_id) grouping, starting with 1 = lowest supply_req_id. Any dupe has a value > 1

;WITH cDupes AS
(
    SELECT
     supply_req_id,
     ROW_NUMBER() OVER (PARTITION BY supply_req_dt, request_id ORDER BY supply_req_id) AS RowNum
    FROM
     MyTable
)
DELETE
   cDupes
WHERE
   RowNum > 1

Then add a unique constraint or INDEX

CREATE UNIQUE INDEX IXU_NoDupes ON MyTable (supply_req_dt, request_id)
み零 2024-10-15 09:25:36

似乎应该有一个命令,但也许那是因为我习惯了不同的数据库服务器。这是相关的支持文档:

如何从 SQL Server 中的表中删除重复的行
http://support.microsoft.com/kb/139444

Seems like there should be a command for this, but maybe that's because I'm used to a different database server. Here's the relevant support doc:

How to remove duplicate rows from a table in SQL Server
http://support.microsoft.com/kb/139444

妞丶爷亲个 2024-10-15 09:25:36

您需要阐明确定在“匹配”情况下保留哪条记录的规则 - 最新的、最早的、is_disabled true 或 false 的记录?

一旦确定了该规则,剩下的事情就相当简单了:

  1. 选择要保留的记录 - 将不同记录
  2. 连接回原始表以获取这些记录的 ID。
  3. 删除不在连接数据集中的所有内容。

假设您想要保留任何“重复”对的最新记录。您的查询将如下所示:

DELETE FROM [table] WHERE supply_req_id NOT IN
(SELECT supply_req_id from [table] t 
INNER JOIN
    (SELECT MAX(supply_req_dt) dt, request_id, supply_id 
    FROM [table] 
    GROUP BY request_id, supply_id) d
ON t.supply_req_dt = d.dt
AND t.request_id = d.request_id 
AND t.supply_id = d.supply_id)

问题是,如果 supply_req_dt 也重复,那么您将保留两个重复项。修复方法是执行另一个 group by 并选择顶部的 id

select MAX(supply_req_id), supply_req_dt, request_id, supply_id 
group by supply_req_dt, request_id, supply_id 

作为临时步骤。但如果您不需要这样做,请不要为此烦恼。

You need to clarify your rule for determining which record to keep in the case of a 'match' - the most recent, the earliest, the one that has is_disabled true, or false?

Once you've identified that rule, the rest is fairly simple:

  1. select the records you want to keep - the distinct records
  2. join back to the original table to get the ids for those records.
  3. delete everthing where not in the joined dataset.

So let's say you want to keep the most recent records of any 'duplicate' pair. Your query would look like this:

DELETE FROM [table] WHERE supply_req_id NOT IN
(SELECT supply_req_id from [table] t 
INNER JOIN
    (SELECT MAX(supply_req_dt) dt, request_id, supply_id 
    FROM [table] 
    GROUP BY request_id, supply_id) d
ON t.supply_req_dt = d.dt
AND t.request_id = d.request_id 
AND t.supply_id = d.supply_id)

The catch is that if the supply_req_dt is also duplicated, then you'll be keeping both of the duplicates. The fix is to do another group by and select the top id

select MAX(supply_req_id), supply_req_dt, request_id, supply_id 
group by supply_req_dt, request_id, supply_id 

as an interim step. But if you don't need to do that, don't bother with it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文