导致查找表相关性
有这两个表:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
和
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
查找这两个表之间的任意关系的好算法是什么?在此示例中,我希望它找到 TableA 中包含 Op1 = C
的记录与 TableB 中包含 Type = 3000
的记录之间的明显关系。
我可以以某种方式想到先验,但似乎不太实用。你们说什么?
谢谢。
Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C
in TableA and Type = 3000
in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这听起来像是一个关系数据挖掘问题。我建议尝试 Ross Quinlan 的 FOIL:http://www.rulequest.com/Personal/
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
在伪代码中,简单的实现可能如下所示:
想法是这样的:对每列的元素与其他列进行成对比较。对于每对列,构造一个链接两列的相应元素的哈希图。如果哈希映射包含与第一列的唯一元素相同数量的链接,那么您就拥有完美的同构;如果多了几个,那就是近乎同构;如果您有更多元素,最多达到第一列中的元素数量,则您所得到的元素可能不代表任何相关性。
输入示例:
为了获得最佳结果,您可以通过两种方式执行此过程 - 即比较表 1 和表 2,然后比较表 2 和表 1 - 以查找双射映射。否则,您可能会被琐碎的情况迷惑......第一个中的所有值都不同(完美同构)或第二个中的所有值都相同(完美同构)。另请注意,此技术提供了一种对列的相似或不相似程度进行排名或测量的方法。
这是否朝着正确的方向发展?顺便说一下,这是 O(ijk),其中表 1 有 i 列,表 2 有 j 列,每列有 k 个元素。理论上,如果您可以在不进行成对比较的情况下找到相关性,那么您可以对方法执行的最佳操作是 O(ik + jk)。
In pseudocode, a naive implementation might look like:
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.