如何修剪重复的关联以产生唯一的最完整的集合
我几乎不知道如何陈述这个问题,更不用说寻找答案了。但这是我最好的镜头。假设我有一个表,
Col1 Col2
-----+-----
A | 1
A | 2
A | 3
A | 4
B | 1
B | 2
B | 3
C | 1
C | 2
C | 3
D | 1
我想找到关联子集(行),其中:
- Col1 中没有重复项
- Col2 中没有重复项 Col1 中
- 的每个值都与 Col2 中的值关联
因此上面的示例可能会产生此
Col1 Col2
-----+-----
A | 4
B | 2
C | 3
D | 1
结果A-4 必须出现在结果中,因为有 4 个唯一的字母和 4 个唯一的数字,因此如果您不将 A 与 4 关联,则没有剩余的子集不能映射 Col1 中的每个值,同时保留 Col2 的唯一性。
另请注意,用 B-3 和 C-2 替换 B-2 和 C-3 同样有效。我不在乎选择哪个子集,但我想要一个满足所有要求的子集。
并非每组数据都会有一个满足所有要求的子集,但我希望尽可能接近。
我正在尝试使用 SQL 查询来完成此操作。我有一个查询似乎可以针对一组数据完成此操作,但随后我必须针对稍微不同的数据集(其中 Col2 实际上是一对列)重写它,并且无法重现我之前的成功。我的第一个解决方案使用 Min() 和 Group By 以及对聚合结果的几个连接来标记重复项以在循环中消除,直到没有任何东西可以安全消除。我最近的解决方案将 Group By 查询替换为使用 PARTITION_BY 的 ROW_NUMBER() 表达式。但我不知道如何处理多重交叉链接对(如上例中的 B 和 C)存在多个有效结果集的情况。我之前的查询可能已经处理了它,但我不太理解我做了什么(当我写那个查询时一定度过了愉快的一天)。也许我需要对子查询中的 ROW_NUMBER 表达式进行 JOIN?今天我的脑子已经不行了。我希望有人能帮助我找到一个巧妙简单的解决方案。
I hardly know how to state this question, let alone search for answers. But here's my best shot. Assume I have a table
Col1 Col2
-----+-----
A | 1
A | 2
A | 3
A | 4
B | 1
B | 2
B | 3
C | 1
C | 2
C | 3
D | 1
I want to find the subset of associations (rows) where:
- There are no duplicates in Col1
- There are no duplicates in Col2
- Every value in Col1 is associated with a value in Col2
So the above example could yield this result
Col1 Col2
-----+-----
A | 4
B | 2
C | 3
D | 1
Notice that A-4 must be in the result because there are 4 unique letters and unique 4 numbers, so if you don't associate A to 4, there's no subset remaining that doesn't map every value in Col1 while retaining the uniqueness of Col2.
Also, notice that it would be equally valid to replace B-2 and C-3 with B-3 and C-2. I don't care which subset is selected, but I want one that fulfills all the requirements.
Not every set of data will have a sub-set that fulfills all the requirements, but I want to get as close as possible.
I'm trying to do this with a SQL query. I had a query that seemed to accomplish this for one set of data, but then I had to rewrite it for a slightly different set (where Col2 is actually a pair of columns) and could not reproduce my earlier success. My first solution used Min() and Group By and a couple Joins on aggregated results to mark duplicates for elimination in a loop until there was nothing left to safely eliminate. My more recent solution replaces the Group By queries with ROW_NUMBER() expressions that use PARTITION_BY. But I can't figure out how to handle the cases where there are multiple valid result sets from multiply-cross-linked pairs like B and C in the above example. My earlier query might have handled it, but I can't quite comprehend what I did (must have had a good day when I wrote that one). Perhaps I need to do a JOIN on the ROW_NUMBER expressions in my sub-queries? My brain gave out for today. I hope someone can help me find an ingeniously simple solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
该问题相当于在二分图中查找最大匹配。每列元素代表一个顶点,每行元素代表一条边。链接的维基百科文章提供了一些解决此问题的算法的指针。 Google 的 or-tools 库中提供了匈牙利算法的实现。
这是给定的示例,用图表表示,红色边缘代表给定的解决方案:
如果您可以纯粹使用 SQL 找到解决方案。
The problem is equivalent to finding a maximum matching in a bipartite graph. Each column element represents a vertex, each row represents an edge. The linked Wikipedia article provides some pointers to algorithms for solving this problem. There is an implementation of the Hungarian algorithm in Google's or-tools library.
Here's the given example formulated as a graph, with the red edges representing the given solution:
It would be surprising to me if you could find a solution purely in SQL.
尝试这个查询,它对于庞大的数据集来说不是很好,但是可以满足您的要求,如果 col1 中存在一个值,它无法找到唯一的 col2 ,它将放置硬编码的 0,将其更改为任何值以指示缺少唯一值价值。我使用名为测试(col1,col2)的表在测试位置替换您的表名称。
这是一种贪婪算法,它会尝试最大化将 Col1 中的值与 Col2 中的所有值关联的机会。步骤如下。
列表项
以下代码实现了此算法,但它不是最佳实现。
Try this query, its not great for huge dataset but does what you want, if there is a value in col1 for which it cannot find a unique col2 it would put 0 which is hardcoded, change it to any value to indicate absense of a unique value. I used table named testing (col1, col2) replace your table name in the place of testing.
This is a greedy algorithm which would try to maximize the chance of associating a value in Col1 to all values of Col2. Steps are as follows.
List item
Following code implements this algo, and its not optimal implementation.
在我看来,你的目标是 SQL 不够强大的东西。这是一个非标准的算法任务,我认为你需要一种真正的编程语言来实现它。你的任务让我想起了国际象棋谜语。
It seems to me that you're aiming for something that SQL is not strong enough for. This is a non-standard algorithmic task, and I think you need a real programming language to achieve it. Your task reminds me of chess riddles.
这似乎可以解决问题(我将查看其他答案并在发布后进行比较):
它可能并不完美,但似乎适用于我的数据。
This seems to do the trick (I will review the other answers and compare after posting):
It may not be perfect, but seems to work on my data.