SQL 查询消除相似条目

发布于 2024-12-01 00:15:10 字数 1644 浏览 4 评论 0原文

我正在解决 SQL Server 2008 中的一个问题，

我有一个包含六列的表：

PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT

该表包含大约一百万条记录。可能值得注意的是，第 n+1 列中的值 > n 列中的值，即 97、98、99、120、135。我试图消除所有具有 5 个共同数字的行（忽略 PK），即：

76, 89, 99, 102, 155, 122
11, 89, 99, 102, 155, 122
89, 99, 102, 155, 122, 130

在这种情况下，算法应该从第一行开始并删除第二行和第三行，因为它们包含 5 个匹配的数字。第一行仍然存在。

我试图强力解决该问题，但仅找到第一个记录的所有重复项需要花费 25 秒以上，这意味着处理整个表将花费......太长的时间（这应该是一个可重复的过程）。

我对 SQL 相当陌生，但这就是我想出的（我想出了一些解决方案，但没有一个是足够的......这是最新的尝试）：（

我不会包含所有代码，但我会解释一下方法，如果有帮助的话我可以粘贴更多）

将记录n的数字保存到变量中。从largeTable 中选择与记录n 有一位共同数字的所有记录。

将所有选定的数字插入 #oneMatch 并包含 [matchingOne] 和匹配的数字。

从临时表中选择与记录 n 有一位共同数字的所有记录，其中“共同数字”！= [匹配]。将所有选定的数字插入#twoMatch 并包括[matchingOne] 和[matchingTwo]...

重复直到插入# FiveMatch。从 largeTable 中删除 # FiveMatch 并移动到记录 n+1

我在实现此解决方案时遇到问题。如何根据 WHERE 子句分配匹配变量？

-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE  ( [dOne] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dOne?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dTwo?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dThree?
...
      OR [dSix] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dSix?
    )

我可以使用六个查询来“伪造”上述内容，但这太低效了......

抱歉，描述很长。任何帮助将不胜感激（新的解决方案或我上面的尝试的实现），因为这个问题已经困扰我一段时间了......

原文

I am working on a problem in SQL Server 2008

I have a table with six columns:

PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT

The table contains around a million recrods. It's probably worth noting that value in column n+1 > value in column n i.e. 97, 98, 99, 120, 135. I am trying to eliminate all rows which have 5 DIGITS in common (ignoring the PK) i.e.:

76, 89, 99, 102, 155, 122
11, 89, 99, 102, 155, 122
89, 99, 102, 155, 122, 130

In this case the algorithm should start at the first row and delete the second and third rows because they contain 5 matching digits. The first row persists.

I have tried to brute force the solution but finding all the duplicates for only the first record takes upwards of 25 seconds meaning processing the whole table would take... way too long (this should be a repeatable process).

I am fairly new to SQL but this is what I have come up with (I have come up with a few solutions but none were adequate... this is the latest attempt):

(I won't include all the code but I will explain the method, I can paste more if it helps)

Save the digits of record n into variables. SELECT all records which have one digit in common with record n FROM largeTable.

Insert all selected digits into #oneMatch and include [matchingOne] with the digit that matched.

Select all records which have one digit in common with record n FROM the temp table WHERE 'digit in common' != [matching]. INSERT all selected digits into #twoMatch and include [matchingOne] AND [matchingTwo]...

Repeat until inserting into #fiveMatch. Delete #fiveMatch from largeTable and move to record n+1

I am having a problem implementing this solution. How can I assign the matching variable depending on the WHERE clause?

-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE  ( [dOne] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dOne?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dTwo?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dThree?
...
      OR [dSix] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dSix?
    )

I am able to 'fake' the above using six queries but that is too inefficient...

Sorry for the long description. Any help would be greatly appreciated (new solution or implementation of my attempt above) as this problem has been nagging at me for a while...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清君侧 2024-12-08 00:15:10

除非我错过了什么，否则这应该会产生正确的结果。

declare @T table 
(
  PK INT identity primary key,
  dOne SmallINT,
  dTwo SmallINT,
  dThree SmallINT,
  dFour SmallINT,
  dFive SmallINT,
  dSix SmallINT
)

insert into @T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)

;with q1(PK, d1, d2, d3, d4, d5) as
(
  select PK, dTwo, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dFive
  from @T
),
q2 as
(
  select PK,
         row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
  from q1
),
q3 as
(
  select PK
  from q2
  where rn = 1
  group by PK
  having count(*) = 6   
)
select T.*
from @T as T
  inner join q3 as Q
    on T.PK = Q.PK

Unless I miss something this should produce the correct result.

declare @T table 
(
  PK INT identity primary key,
  dOne SmallINT,
  dTwo SmallINT,
  dThree SmallINT,
  dFour SmallINT,
  dFive SmallINT,
  dSix SmallINT
)

insert into @T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)

;with q1(PK, d1, d2, d3, d4, d5) as
(
  select PK, dTwo, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dFive
  from @T
),
q2 as
(
  select PK,
         row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
  from q1
),
q3 as
(
  select PK
  from q2
  where rn = 1
  group by PK
  having count(*) = 6   
)
select T.*
from @T as T
  inner join q3 as Q
    on T.PK = Q.PK

回复收藏 0 原文

能怎样 2024-12-08 00:15:10

我无法对性能做出任何承诺，但你可以尝试一下。我做的第一件事是将数据放入更规范化的结构中。

CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO

INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO

SELECT DISTINCT
    T2.my_id
FROM
    (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
    (
    SELECT
        COUNT(*)
    FROM
        dbo.Test_Sets_Normalized T3
    INNER JOIN dbo.Test_Sets_Normalized T4 ON
        T4.my_id = T2.my_id AND
        T4.c = T3.c
    WHERE
        T3.my_id = T1.my_id) >= 5

这应该会为您提供所需的 ID。一旦您确认它符合您的要求，您就可以 JOIN 返回原始表并按 ID 进行删除。

可能有不需要 DISTINCT 的地方可以进行改进。我会再考虑一下。

I can't make any promises on performance, but you can try this. The first thing that I do is put the data into a more normalized structure.

CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO

INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO

SELECT DISTINCT
    T2.my_id
FROM
    (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
    (
    SELECT
        COUNT(*)
    FROM
        dbo.Test_Sets_Normalized T3
    INNER JOIN dbo.Test_Sets_Normalized T4 ON
        T4.my_id = T2.my_id AND
        T4.c = T3.c
    WHERE
        T3.my_id = T1.my_id) >= 5

That should get you the IDs that you need. Once you've confirmed that it does what you want, you can JOIN back to the original table and delete by IDs.

There's probably an improvement possible somewhere that doesn't require the DISTINCT. I'll give it a little more thought.

回复收藏 0 原文

帅冕 2024-12-08 00:15:10

编辑 - 以下方法可能比 N 平方性能更好，具体取决于优化器。如果所有 5 列都建立了索引，则每行只需要 6 次索引查找，这仍然是 N * logN。不过，它看起来确实有点傻。

您可以根据 5 个匹配的所有排列编写代码生成 where 条件：因此要删除的记录将由以下给出：

SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
    SELECT PK From SillyTable Duplicate 
    WHERE (   (
            (Duplicate.dOne=ToDelete.dOne) 
            AND (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree)
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
        ) OR (
            (Duplicate.dOne=ToDelete.dTwo) 
            AND (Duplicate.dTwo=ToDelete.dThree) 
            AND (Duplicate.dThree=ToDelete.dFour)
            AND (Duplicate.dFour=ToDelete.dFive)
            AND (Duplicate.dFive=ToDelete.dSix)
        ) OR (
            (Duplicate.dTwo=ToDelete.dOne) 
            AND (Duplicate.dThree=ToDelete.dTwo) 
            AND (Duplicate.dFour=ToDelete.dThree)
            AND (Duplicate.dFive=ToDelete.dFour)
            AND (Duplicate.dSix=ToDelete.dFive)
        ) OR (
            (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree) 
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
            AND (Duplicate.dSix=ToDelete.dSix)
        ) ...

这将继续覆盖所有 36 个组合（连接的每一侧都有一个不匹配，超出6 个可能的列，因此 6*6 为您提供了所有可能性）。我会编码生成这个，因为它需要大量输入，如果你明天想要 6 场比赛中的 4 场比赛怎么办，但我想你可以手动编码。

Edit - the following approach might be better than N squared performance, depending on the optimizer. If all 5 columns are indexed it should only need 6 index seeks per row, which is still N * logN. It does seem a little dopey though.

You could code generate the where condition based on all the permutations of 5 matches: so the records to delete would be given by:

SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
    SELECT PK From SillyTable Duplicate 
    WHERE (   (
            (Duplicate.dOne=ToDelete.dOne) 
            AND (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree)
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
        ) OR (
            (Duplicate.dOne=ToDelete.dTwo) 
            AND (Duplicate.dTwo=ToDelete.dThree) 
            AND (Duplicate.dThree=ToDelete.dFour)
            AND (Duplicate.dFour=ToDelete.dFive)
            AND (Duplicate.dFive=ToDelete.dSix)
        ) OR (
            (Duplicate.dTwo=ToDelete.dOne) 
            AND (Duplicate.dThree=ToDelete.dTwo) 
            AND (Duplicate.dFour=ToDelete.dThree)
            AND (Duplicate.dFive=ToDelete.dFour)
            AND (Duplicate.dSix=ToDelete.dFive)
        ) OR (
            (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree) 
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
            AND (Duplicate.dSix=ToDelete.dSix)
        ) ...

This goes on to cover all 36 combinations (there is one non-match on each side of the join, out of 6 possible columns, so 6*6 gives you all the possibilites). I would code generate this because it's a lot of typing, and what if you want 4 out of 6 matches tomorrow, but you could hand code it I guess.

回复收藏 0 原文

~没有更多了~