识别重复记录组 (TSQL)

发布于 2025-01-11 13:14:48 字数 5542 浏览 0 评论 0原文

我在 MSSQL 数据库中有大量表(超过 95 000 000 条记录)

idConfiguration_idEquipment_group_idnamePrice
11100item110
21100item220
31100item330
42100item110
52100item220
62100item330
73100item110
83100item220
93100item331

我将识别重复的记录组。

配置 1 组

idConfiguration_idEquipment_group_idnamePrice
11100item110
21100item220
31100item330

配置 2 组

idconfiguration_idEquipment_group_idnamePrice
42100item110
52100项目2 20
62100item330

配置 3 组

id配置_idEquipment_group_idname价格
73100item110
83100item220
93100item331
  • 在我的逻辑中 组 1组2 是重复项
    • 具有相同数量的记录
    • equipment_group_idnameprice字段中具有相同的内容
  • 在字段 equipment_group_idnameprice Group 1>组 3 不重复,因为至少有一个不同的元素(最后一条记录的价格为 31,而不是 30)

如何构建查询来查找表中重复的所有组(不是记录) ?

I have got massive table (over 95 000 000 records) in MSSQL database

idconfiguration_idequipment_group_idnameprice
11100item110
21100item220
31100item330
42100item110
52100item220
62100item330
73100item110
83100item220
93100item331

I am going to identify duplicated group of records.

Configuration 1 Group

idconfiguration_idequipment_group_idnameprice
11100item110
21100item220
31100item330

Configuration 2 Group

idconfiguration_idequipment_group_idnameprice
42100item110
52100item220
62100item330

Configuration 3 Group

idconfiguration_idequipment_group_idnameprice
73100item110
83100item220
93100item331
  • in my logic Group 1 and Group 2 are duplicates
    • has the same number of records
    • has the same content in fields equipment_group_id, name, price
  • Group 1 and Group 3 are NOT duplicates because there is at least one different element (last record has price 31, not 30)

How to construct a query to find all groups that are duplicated (not records) across the table?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦初启 2025-01-18 13:14:49

对 95M 记录的查询性能可能并不理想,但这应该可以解决问题。

查找包含多行的组的精确匹配

DROP TABLE IF EXISTS #Config

CREATE TABLE #Config 
(id int
,configuration_id int
,equipment_group_id int
,name VARCHAR(100)
,price INT
)

INSERT INTO #Config
VALUES
(1,1,100,'item1',10)
,(2,1,100,'item2',20)
,(3,1,100,'item3',30)
,(4,2,100,'item1',10)
,(5,2,100,'item2',20)
,(6,2,100,'item3',30)
,(7,3,100,'item1',10)
,(8,3,100,'item2',20)
,(9,3,100,'item3',31)


;WITH cte_ConfigCount AS (
    SELECT *,ConfigTotalRowCnt = COUNT(*) OVER (PARTITION BY A.configuration_id) /*Counts how many rows in each config*/
    FROM #Config AS A
)

SELECT 
    A.configuration_id
    ,B.configuration_id
    ,TextDescription = CONCAT('Config #',A.configuration_id,' matches Config #',B.configuration_id)
    ,A.ConfigTotalRowCnt
    ,RowsMatch = COUNT(*)
FROM cte_ConfigCount AS A
INNER JOIN cte_ConfigCount AS B
    ON A.configuration_id < B.configuration_id /*Don't join to self and only join 1 way (so don't have one row with A-B and another row with B-A)*/
    AND a.equipment_group_id = B.equipment_group_id
    AND A.name = B.name
    AND A.price = B.price
GROUP BY A.configuration_id,A.ConfigTotalRowCnt,B.configuration_id
HAVING A.ConfigTotalRowCnt = COUNT(*) /*Only return where the total row for the config matches the rows where the configs match*/

Performance of this query for 95M records will probably not be ideal, but this should do the trick.

Find Exact Matches of Groups Containing Multiple Rows

DROP TABLE IF EXISTS #Config

CREATE TABLE #Config 
(id int
,configuration_id int
,equipment_group_id int
,name VARCHAR(100)
,price INT
)

INSERT INTO #Config
VALUES
(1,1,100,'item1',10)
,(2,1,100,'item2',20)
,(3,1,100,'item3',30)
,(4,2,100,'item1',10)
,(5,2,100,'item2',20)
,(6,2,100,'item3',30)
,(7,3,100,'item1',10)
,(8,3,100,'item2',20)
,(9,3,100,'item3',31)


;WITH cte_ConfigCount AS (
    SELECT *,ConfigTotalRowCnt = COUNT(*) OVER (PARTITION BY A.configuration_id) /*Counts how many rows in each config*/
    FROM #Config AS A
)

SELECT 
    A.configuration_id
    ,B.configuration_id
    ,TextDescription = CONCAT('Config #',A.configuration_id,' matches Config #',B.configuration_id)
    ,A.ConfigTotalRowCnt
    ,RowsMatch = COUNT(*)
FROM cte_ConfigCount AS A
INNER JOIN cte_ConfigCount AS B
    ON A.configuration_id < B.configuration_id /*Don't join to self and only join 1 way (so don't have one row with A-B and another row with B-A)*/
    AND a.equipment_group_id = B.equipment_group_id
    AND A.name = B.name
    AND A.price = B.price
GROUP BY A.configuration_id,A.ConfigTotalRowCnt,B.configuration_id
HAVING A.ConfigTotalRowCnt = COUNT(*) /*Only return where the total row for the config matches the rows where the configs match*/
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文