导致查找表相关性

发布于 2024-12-14 02:50:34 字数 462 浏览 7 评论 0原文

有这两个表：

TableA
ID  Opt1    Opt2    Type
1   A       Z       10
2   B       Y       20
3   C       Z       30
4   C       K       40

和

TableB
ID  Opt1    Type
1   Z       57
2   Z       99
3   X       3000
4   Z       3000

查找这两个表之间的任意关系的好算法是什么？在此示例中，我希望它找到 TableA 中包含 Op1 = C 的记录与 TableB 中包含 Type = 3000 的记录之间的明显关系。

我可以以某种方式想到先验，但似乎不太实用。你们说什么？

谢谢。

原文

Have these two tables:

TableA
ID  Opt1    Opt2    Type
1   A       Z       10
2   B       Y       20
3   C       Z       30
4   C       K       40

and

TableB
ID  Opt1    Type
1   Z       57
2   Z       99
3   X       3000
4   Z       3000

What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.

I could think of apriori in some way, but doesn't seems too practical. what you guys say?

thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半步萧音过轻尘 2024-12-21 02:50:34

这听起来像是一个关系数据挖掘问题。我建议尝试 Ross Quinlan 的 FOIL：http://www.rulequest.com/Personal/

回复收藏 0 原文

夏日落 2024-12-21 02:50:34

在伪代码中，简单的实现可能如下所示：

  1. for each column c1 in table1
  2.    for each column c2 in table2
  3.      if approximately_isomorphic(c1, c2) then
  4.         emit (c1, c2)

  approximately_isomorphic(c1, c2)
  1. hmap = hash()
  2. for i = 1 to min(|c1|, |c2|) do
  3.    hmap[c1[i]] = c2[i]
  4. if |hmap| - unique_count(c1) < error_margin then return true
  5. else then return false

想法是这样的：对每列的元素与其他列进行成对比较。对于每对列，构造一个链接两列的相应元素的哈希图。如果哈希映射包含与第一列的唯一元素相同数量的链接，那么您就拥有完美的同构；如果多了几个，那就是近乎同构；如果您有更多元素，最多达到第一列中的元素数量，则您所得到的元素可能不代表任何相关性。

输入示例：

  ID & anything  : perfect isomorphism since all of ID are unique

  Opt1 & ID      : 4 mappings and 3 unique values; not a perfect
                  isomorphism, but not too far away.
  Opt1 & Opt1    : ditto above
  Opt1 & Type    : 3 mappings & 3 unique values, perfect isomorphism

  Opt2 & ID      : 4 mappings & 3 unique values, not a perfect
                  isomorphism, but not too far away
  Opt2 & Opt2    : ditto above
  Opt2 & Type    : ditto above

  Type & anything: perfect isomorphism since all of ID are unique

为了获得最佳结果，您可以通过两种方式执行此过程 - 即比较表 1 和表 2，然后比较表 2 和表 1 - 以查找双射映射。否则，您可能会被琐碎的情况迷惑......第一个中的所有值都不同（完美同构）或第二个中的所有值都相同（完美同构）。另请注意，此技术提供了一种对列的相似或不相似程度进行排名或测量的方法。

这是否朝着正确的方向发展？顺便说一下，这是 O(ijk)，其中表 1 有 i 列，表 2 有 j 列，每列有 k 个元素。理论上，如果您可以在不进行成对比较的情况下找到相关性，那么您可以对方法执行的最佳操作是 O(ik + jk)。

In pseudocode, a naive implementation might look like:

  1. for each column c1 in table1
  2.    for each column c2 in table2
  3.      if approximately_isomorphic(c1, c2) then
  4.         emit (c1, c2)

  approximately_isomorphic(c1, c2)
  1. hmap = hash()
  2. for i = 1 to min(|c1|, |c2|) do
  3.    hmap[c1[i]] = c2[i]
  4. if |hmap| - unique_count(c1) < error_margin then return true
  5. else then return false

The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.

Example on your input:

  ID & anything  : perfect isomorphism since all of ID are unique

  Opt1 & ID      : 4 mappings and 3 unique values; not a perfect
                  isomorphism, but not too far away.
  Opt1 & Opt1    : ditto above
  Opt1 & Type    : 3 mappings & 3 unique values, perfect isomorphism

  Opt2 & ID      : 4 mappings & 3 unique values, not a perfect
                  isomorphism, but not too far away
  Opt2 & Opt2    : ditto above
  Opt2 & Type    : ditto above

  Type & anything: perfect isomorphism since all of ID are unique

For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.

Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.

回复收藏 0 原文

~没有更多了~