如何修剪重复的关联以产生唯一的最完整的集合

发布于 2024-10-21 17:18:16 字数 1023 浏览 2 评论 0原文

我几乎不知道如何陈述这个问题,更不用说寻找答案了。但这是我最好的镜头。假设我有一个表,

Col1   Col2
-----+-----
 A   | 1
 A   | 2
 A   | 3
 A   | 4
 B   | 1
 B   | 2
 B   | 3
 C   | 1
 C   | 2
 C   | 3
 D   | 1

我想找到关联子集(行),其中:

  1. Col1 中没有重复项
  2. Col2 中没有重复项 Col1 中
  3. 的每个值都与 Col2 中的值关联

因此上面的示例可能会产生此

Col1   Col2
-----+-----
 A   | 4
 B   | 2
 C   | 3
 D   | 1

结果A-4 必须出现在结果中,因为有 4 个唯一的字母和 4 个唯一的数字,因此如果您不将 A 与 4 关联,则没有剩余的子集不能映射 Col1 中的每个值,同时保留 Col2 的唯一性。

另请注意,用 B-3 和 C-2 替换 B-2 和 C-3 同样有效。我不在乎选择哪个子集,但我想要一个满足所有要求的子集。

并非每组数据都会有一个满足所有要求的子集,但我希望尽可能接近。

我正在尝试使用 SQL 查询来完成此操作。我有一个查询似乎可以针对一组数据完成此操作,但随后我必须针对稍微不同的数据集(其中 Col2 实际上是一对列)重写它,并且无法重现我之前的成功。我的第一个解决方案使用 Min() 和 Group By 以及对聚合结果的几个连接来标记重复项以在循环中消除,直到没有任何东西可以安全消除。我最近的解决方案将 Group By 查询替换为使用 PARTITION_BY 的 ROW_NUMBER() 表达式。但我不知道如何处理多重交叉链接对(如上例中的 B 和 C)存在多个有效结果集的情况。我之前的查询可能已经处理了它,但我不太理解我做了什么(当我写那个查询时一定度过了愉快的一天)。也许我需要对子查询中的 ROW_NUMBER 表达式进行 JOIN?今天我的脑子已经不行了。我希望有人能帮助我找到一个巧妙简单的解决方案。

I hardly know how to state this question, let alone search for answers. But here's my best shot. Assume I have a table

Col1   Col2
-----+-----
 A   | 1
 A   | 2
 A   | 3
 A   | 4
 B   | 1
 B   | 2
 B   | 3
 C   | 1
 C   | 2
 C   | 3
 D   | 1

I want to find the subset of associations (rows) where:

  1. There are no duplicates in Col1
  2. There are no duplicates in Col2
  3. Every value in Col1 is associated with a value in Col2

So the above example could yield this result

Col1   Col2
-----+-----
 A   | 4
 B   | 2
 C   | 3
 D   | 1

Notice that A-4 must be in the result because there are 4 unique letters and unique 4 numbers, so if you don't associate A to 4, there's no subset remaining that doesn't map every value in Col1 while retaining the uniqueness of Col2.

Also, notice that it would be equally valid to replace B-2 and C-3 with B-3 and C-2. I don't care which subset is selected, but I want one that fulfills all the requirements.

Not every set of data will have a sub-set that fulfills all the requirements, but I want to get as close as possible.

I'm trying to do this with a SQL query. I had a query that seemed to accomplish this for one set of data, but then I had to rewrite it for a slightly different set (where Col2 is actually a pair of columns) and could not reproduce my earlier success. My first solution used Min() and Group By and a couple Joins on aggregated results to mark duplicates for elimination in a loop until there was nothing left to safely eliminate. My more recent solution replaces the Group By queries with ROW_NUMBER() expressions that use PARTITION_BY. But I can't figure out how to handle the cases where there are multiple valid result sets from multiply-cross-linked pairs like B and C in the above example. My earlier query might have handled it, but I can't quite comprehend what I did (must have had a good day when I wrote that one). Perhaps I need to do a JOIN on the ROW_NUMBER expressions in my sub-queries? My brain gave out for today. I hope someone can help me find an ingeniously simple solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

叫思念不要吵 2024-10-28 17:18:16

该问题相当于在二分图中查找最大匹配。每列元素代表一个顶点,每行元素代表一条边。链接的维基百科文章提供了一些解决此问题的算法的指针。 Google 的 or-tools 库中提供了匈牙利算法的实现。

这是给定的示例,用图表表示,红色边缘代表给定的解决方案:

graph

如果您可以纯粹使用 SQL 找到解决方案。

The problem is equivalent to finding a maximum matching in a bipartite graph. Each column element represents a vertex, each row represents an edge. The linked Wikipedia article provides some pointers to algorithms for solving this problem. There is an implementation of the Hungarian algorithm in Google's or-tools library.

Here's the given example formulated as a graph, with the red edges representing the given solution:

graph

It would be surprising to me if you could find a solution purely in SQL.

耳根太软 2024-10-28 17:18:16

尝试这个查询,它对于庞大的数据集来说不是很好,但是可以满足您的要求,如果 col1 中存在一个值,它无法找到唯一的 col2 ,它将放置硬编码的 0,将其更改为任何值以指示缺少唯一值价值。我使用名为测试(col1,col2)的表在测试位置替换您的表名称。

这是一种贪婪算法,它会尝试最大化将 Col1 中的值与 Col2 中的所有值关联的机会。步骤如下。

  1. 根据关联的 Col2 值的数量按升序检索 Col1。
  2. 从 Col2 数量最少的 Col1 开始,并关联该值(从 D 开始,因为仅关联一个值)。
  3. 转到下一个未关联的值( B 或 C 因为它们有 3 个值,所以关联不在已关联值列表中的任何值, 1 与 D 关联,因此 2 或 3 )。
  4. 对步骤 1 中选择的列表中的所有值重复步骤 3。

列表项

以下代码实现了此算法,但它不是最佳实现。

DECLARE @COUNTER    INT = 1
DECLARE @MAX        INT = 0  
DECLARE @COL2       CHAR(1) = NULL

DECLARE @TEMPTABLE TABLE
(
    ROWNUM  INT     IDENTITY(1,1)
    ,COL1   CHAR(1)
    ,COL2   INT
)

INSERT INTO @TEMPTABLE
SELECT COL1, 0
FROM    testing
GROUP BY COL1
ORDER BY COUNT(COL2)

SELECT @MAX = MAX(ROWNUM) FROM @TEMPTABLE

WHILE (  @COUNTER <= @MAX )
BEGIN
        UPDATE @TEMPTABLE 
        SET COL2 = T.COL2
        FROM TESTING T
        INNER JOIN @TEMPTABLE TT
        ON  T.COL1 = TT.COL1
        WHERE T.COL2 NOT IN (SELECT DISTINCT COL2 FROM @TEMPTABLE)
        AND TT.ROWNUM = @COUNTER
        SET @COUNTER = @COUNTER + 1
END

SELECT COL1, COL2 FROM @TEMPTABLE

Try this query, its not great for huge dataset but does what you want, if there is a value in col1 for which it cannot find a unique col2 it would put 0 which is hardcoded, change it to any value to indicate absense of a unique value. I used table named testing (col1, col2) replace your table name in the place of testing.

This is a greedy algorithm which would try to maximize the chance of associating a value in Col1 to all values of Col2. Steps are as follows.

  1. Retrieve Col1 based on the number of Col2 values it is associated in ascending order.
  2. Start with the Col1 which has minimal number of Col2 and associate the value (Start with D as only one value is associated).
  3. Go to next unassociated value (B or C since they have 3 values, associate any of the value which is not in the list of already associated value, 1 is associated with D so 2 or 3 ).
  4. Repeat step 3 for all values in the list selected in step 1.

List item

Following code implements this algo, and its not optimal implementation.

DECLARE @COUNTER    INT = 1
DECLARE @MAX        INT = 0  
DECLARE @COL2       CHAR(1) = NULL

DECLARE @TEMPTABLE TABLE
(
    ROWNUM  INT     IDENTITY(1,1)
    ,COL1   CHAR(1)
    ,COL2   INT
)

INSERT INTO @TEMPTABLE
SELECT COL1, 0
FROM    testing
GROUP BY COL1
ORDER BY COUNT(COL2)

SELECT @MAX = MAX(ROWNUM) FROM @TEMPTABLE

WHILE (  @COUNTER <= @MAX )
BEGIN
        UPDATE @TEMPTABLE 
        SET COL2 = T.COL2
        FROM TESTING T
        INNER JOIN @TEMPTABLE TT
        ON  T.COL1 = TT.COL1
        WHERE T.COL2 NOT IN (SELECT DISTINCT COL2 FROM @TEMPTABLE)
        AND TT.ROWNUM = @COUNTER
        SET @COUNTER = @COUNTER + 1
END

SELECT COL1, COL2 FROM @TEMPTABLE
热情消退 2024-10-28 17:18:16

在我看来,你的目标是 SQL 不够强大的东西。这是一个非标准的算法任务,我认为你需要一种真正的编程语言来实现它。你的任务让我想起了国际象棋谜语。

It seems to me that you're aiming for something that SQL is not strong enough for. This is a non-standard algorithmic task, and I think you need a real programming language to achieve it. Your task reminds me of chess riddles.

软糯酥胸 2024-10-28 17:18:16

这似乎可以解决问题(我将查看其他答案并在发布后进行比较):

CREATE TABLE Trial(Col1 nvarchar(5) not null, Col2 int not null, Eliminated bit not null)

INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 4, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('D', 1, 0)

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col1) T1
   ON T0.Col1 = T1.Col1
JOIN (
   SELECT Col2, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col2) T2
   ON T2.Col2 = T0.Col2
WHERE T2.Dups > T1.Dups AND T1.Dups > 1

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col1) T1
   ON T0.Col1 = T1.Col1
JOIN (
   SELECT Col2, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col2) T2
   ON T2.Col2 = T0.Col2
WHERE T1.Dups > T2.Dups AND T2.Dups > 1

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, Col2, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col2) Dup
   FROM Trial
   WHERE Eliminated = 0) T1 ON T1.Col1 = T0.Col1 AND T1.Col2 = T0.Col2
JOIN (
   SELECT Col1, Col2, ROW_NUMBER() OVER (PARTITION BY Col2 ORDER BY Col1) Dup
   FROM Trial
   WHERE Eliminated = 0) T2 ON T2.Col1 = T0.Col1 AND T2.Col2 = T0.Col2
WHERE T1.Dup <> T2.Dup

它可能并不完美,但似乎适用于我的数据。

This seems to do the trick (I will review the other answers and compare after posting):

CREATE TABLE Trial(Col1 nvarchar(5) not null, Col2 int not null, Eliminated bit not null)

INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 4, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('D', 1, 0)

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col1) T1
   ON T0.Col1 = T1.Col1
JOIN (
   SELECT Col2, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col2) T2
   ON T2.Col2 = T0.Col2
WHERE T2.Dups > T1.Dups AND T1.Dups > 1

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col1) T1
   ON T0.Col1 = T1.Col1
JOIN (
   SELECT Col2, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col2) T2
   ON T2.Col2 = T0.Col2
WHERE T1.Dups > T2.Dups AND T2.Dups > 1

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, Col2, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col2) Dup
   FROM Trial
   WHERE Eliminated = 0) T1 ON T1.Col1 = T0.Col1 AND T1.Col2 = T0.Col2
JOIN (
   SELECT Col1, Col2, ROW_NUMBER() OVER (PARTITION BY Col2 ORDER BY Col1) Dup
   FROM Trial
   WHERE Eliminated = 0) T2 ON T2.Col1 = T0.Col1 AND T2.Col2 = T0.Col2
WHERE T1.Dup <> T2.Dup

It may not be perfect, but seems to work on my data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文