用于发现数据库中事实上的外键的工具?

发布于 2024-11-28 22:18:17 字数 1539 浏览 4 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

短叹 2024-12-05 22:18:17

有趣的问题。您希望解析数据库模式和数据,以确定哪些表相关或应该相互关联,而无需对关系进行任何严格的定义。实际上,您正在尝试推断一种关系。

我认为有两种方法可以推断出这种关系。首先我要说的是,您的方法可能会根据您正在使用的数据库而有所不同。脑海中浮现出许多问题(我不需要答案,但它们值得反思)

  • 这些内部企业系统是否遵循某种一致的命名约定或模式?
  • 或者它们是您随时随地遇到的“野外”数据库?
  • 你准备做出什么样的假设?
  • 您希望结果中有更多误报还是漏报?

请注意,这种类型的推理几乎肯定会给出错误的结果,并且是建立在很多假设之上的。

因此,我提供了两种可以配合使用的方法。

通过结构/命名(符号分析)推断关系

常见的数据库设计是在表名之后命名 PK 列(例如表 CustomerCustomerId >),或者将 PK 列简单命名为 Id

与另一个表具有 FK 关系的表通常将其相关列命名为与相关表相同的名称。在 Order 表中,我期望有一个 CustomerId 列,它引用 CustomerId / Id 列代码>客户表。

这种类型的分析包括

  • 检查表中的列是否有相似的短语/单词,
  • 查找与其他表的名称相似的列名称
  • ,检查包含其他列名称的列名称(例如 FirstCustomerId & ; SecondCustomerId 均引用 Customer 表中的 CustomerId 列)

通过数据推断关系(统计分析)

从数据来看,正如您在评论中建议的那样,您可以确定“可能的”参考资料。如果 Order 表中的 CustomerId 列包含 CustomerId 列中不存在的值表,那么有理由质疑这是一个有效的关系(尽管你永远不知道!)

数据分析的一种简单形式是使用日期和时间。创建的彼此非常接近的行更有可能彼此相关。如果对于创建的每个 Order 行,在几秒钟内还存在创建的 2 到 5 个 Item 行,则两者之间可能存在关系。

更详细的分析可能会关注使用值的范围和分布。

例如,如果您的 Order 表有一个 St_Id 列 - 您可以使用符号分析推断该列可能与 State表或 Status 表。 St_Id 列有 6 个离散值,2 个值覆盖了 90% 的记录。 State 表有 200 行,Status 表有 9 行。您可以相当合理地推断出 St_Id 列与 Status 表相关 - 它提供了对表行的更大覆盖范围(2/3 的行是 '使用',而 State 表中只有 3% 的行会被使用)。

如果您对现有数据库进行数据分析以收集“现实生活数据”,我希望有一些模式可以用作结构推理的指南。当包含大量记录的表中的一列包含多次重复的少量值(不一定按顺序)时,该列更有可能与相应的行数较少的表相关。

总结

祝你好运。这是一个有趣的问题,我刚刚提出了一些想法,但这在很大程度上是一个试验和尝试。错误、数据收集和性能调整情况。

Interesting question. You're looking to parse a database schema and data to determine which tables are relevant or should be related to each other, without any strict definition of the relationship. In effect, you're trying to infer a relationship.

I see two ways that you can infer such a relationship. First let me say that your approach might vary depending on the databases you're working with. A number of questions spring to mind (I don't want answers, but they are worth reflecting on)

  • are these in-house enterprise systems that follow some consistent naming convention or pattern?
  • or are they 'in-the-wild' databases that you come across anywhere, at any time?
  • what sort of assumptions are you prepared to make?
  • would you prefer to get more false positives or false negatives in your result?

Note that this type of inference will almost certainly give false results, and is built on a lot of assumptions.

So I offer two approachs that I'd use in concert.

Inferring a relationship through structure / naming (symbolic analysis)

Common database design is to name a PK column after the table name (e.g. CustomerId on table Customer), or alternatively name the PK column simply Id.

A table with a FK relationship to another often names its related column the same as the related table. In the Order table I'd expect a CustomerId column which refers to the CustomerId / Id column in the Customer table.

This type of analysis would include

  • inspecting columns across tables for similar phrases / words
  • looking for columns names that are similar to the names of other tables
  • checking for column names that contain the name of other column (e.g. FirstCustomerId & SecondCustomerId both refer to the CustomerId column in the Customer table)

Inferring a relationship through data (statistical analysis)

Looking at data, as you suggest you have done in your comments, will allow you to determine 'possible' references. If the CustomerId column in the Order table contains values which don't exist in the Id column of the Customer table then it's reasonable to question that this is a valid relationship (although you never know!)

A simple form of data analysis is using dates and times. Rows that were created with close proximity to one another are more likely to be related to one another. If, for every Order row that was created, there also exist between 2 and 5 Item rows created within a few seconds, then a relationship between the two is likely.

A more detailed analysis might look at the range and distribution of used values.

For example, if your Order table has a St_Id column - you might infer using symbolic analysis that the column is likely to relate to either a State table or a Status table. The St_Id column has 6 discrete values, and 90% of the records are covered by 2 values. The State table has 200 rows, and the Status table has 9 rows. You could quite reasonably infer that the St_Id column relates to the Status table - it gives a more greater coverage of the rows of the table (2/3 of the rows are 'used', whereas only 3% of the rows in the State table would be used).

If you perform data analysis on existing databases to gather 'real life data', I'd expect some patterns that could be used as guides to structure inference. When a table with a large number of records has a column with a small number of values repeated many times (not necessarily in order), it's more likely to this column relates to a table with a correspondingly small number of rows.

In summary

Best of luck. It's an interested problem, I've just thrown some ideas out there but this is very much a trial & error, data gathering and performance tuning situation.

玩物 2024-12-05 22:18:17

在大多数情况下,这是一个不平凡的练习。如果您足够幸运,能够分析现代框架(例如 Ruby on Rails、CakePHP 或类似框架)的模式,并且开发人员严格遵守列约定,那么您有合理的机会找到许多(但不是全部) ,隐含关系。

即,如果您的表使用 user_id 之类的列来引用 users 表中的条目。

注意:某些实体名称可能不规则复数(entity 就是一个很好的例子:entities,而不是 实体),这些更难捕获(但仍然有可能)。但是,无法推断开发人员与 user.id 上的用户表连接的 admin_id 等键。您需要手动处理这些情况。

您没有指定 RDBMS,但我经常使用 MySQL,并且我目前正在为自己解决这个问题。

以下 MySQL 脚本将推断列名称隐含的大多数关系。然后它会列出无法找到表名的所有关系,因此至少您知道丢失了哪些关系。列出推断的父项和子项,以及单数和复数名称,以及隐含的关系:

-- this DB is where MySQL keeps schema information
use information_schema;

-- change this to the DB you want to analyse
set @db_name = "example_DB";

-- infer relationships 
-- NB: this won't catch names that pluralise irregularly like category -> categories or bus_id -> buses etc.
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 )              as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")  as inferred_parent_plural 
,       C.TABLE_NAME                                                  as child_table
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME)-3), "s has many ", C.TABLE_NAME) as inferred_relationship
from    COLUMNS C
JOIN    TABLES T on C.TABLE_NAME = T.TABLE_NAME 
        and C.TABLE_SCHEMA = T.TABLE_SCHEMA 
        and T.TABLE_TYPE != "VIEW"  -- filter out views; comment this line if you want to include them
where   COLUMN_NAME like "%_id"     -- look for columns of the form <name>_id
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- finally make sure to filter out any inferred names that aren't really tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") -- this is the inferred_parent_plural, but can't use column aliases in the where clause sadly
          in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

这将返回如下结果:
在此处输入图像描述

然后您可以检查检测到的任何命名约定异常:

-- Now list any inferred parents that weren't real tables to see see why (irregular plurals and columns not named according to convention)
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural 
,       C.TABLE_NAME as child_table
from    COLUMNS C
JOIN    TABLES T  on  C.TABLE_NAME    = T.TABLE_NAME 
                  and C.TABLE_SCHEMA  = T.TABLE_SCHEMA 
                  and T.TABLE_TYPE   != "VIEW"            -- filter out views, comment this line if you want to include them
where   COLUMN_NAME like "%_id"
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- this time only include inferred names that aren't real tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")
          not in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

这将返回如下结果,您可以对其进行处理手动:
在此处输入图像描述

您可以修改这些脚本以输出对您有用的任何内容,包括外键创建语句,如果您想要。在这里,最后一栏是一个简单的“有很多”关系语句。我在我构建的一个名为 pidgin 的工具中使用了它,这是一种快速建模工具,可以根据用非常简单的语法(称为“pidgin”)编写的关系语句动态绘制关系图。您可以在 http://pidgin.gruffdavies.com 上查看

我已经在演示中运行了上述脚本DB 向您展示您可以期待的结果:

在此处输入图像描述

我还没有考虑到不规则复数我的剧本,但我也许也可以尝试一下,至少对于以 -y 结尾的实体的情况。如果您想自己尝试一下,我建议您编写一个存储函数,该函数将 _id 列名称作为参数,并删除 _id 部分然后应用一些启发式方法来尝试正确地进行复数化。

希望这有用!

This is a non-trivial exercise in most cases. If you are lucky enough to be analysing a schema for a modern framework, such as Ruby on Rails, or CakePHP or similar, and the developers have been stringent about following column conventions, then you have a reasonable chance of finding many, but not all, of the implied relationships.

I.e. if your tables use columns like user_id to refer to entries in the users tables.

Be aware: some entity names may pluralise irregularly (entity being a good example: entities, not entitys) and these are harder to catch (but still possible). However, keys such as admin_id which the developers join with the users table on user.id can't be inferred. You would need to handle those cases manually.

You didn't specify an RDBMS, but I used MySQL a lot, and I'm currently working on this problem for myself.

The following MySQL script will infer most relationships implied by column names. It then lists any relationships that it could not find table names for, so at least you know which ones you're missing. The inferred parent and child are listed, along with singular and plural names, plus the implied relationship:

-- this DB is where MySQL keeps schema information
use information_schema;

-- change this to the DB you want to analyse
set @db_name = "example_DB";

-- infer relationships 
-- NB: this won't catch names that pluralise irregularly like category -> categories or bus_id -> buses etc.
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 )              as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")  as inferred_parent_plural 
,       C.TABLE_NAME                                                  as child_table
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME)-3), "s has many ", C.TABLE_NAME) as inferred_relationship
from    COLUMNS C
JOIN    TABLES T on C.TABLE_NAME = T.TABLE_NAME 
        and C.TABLE_SCHEMA = T.TABLE_SCHEMA 
        and T.TABLE_TYPE != "VIEW"  -- filter out views; comment this line if you want to include them
where   COLUMN_NAME like "%_id"     -- look for columns of the form <name>_id
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- finally make sure to filter out any inferred names that aren't really tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") -- this is the inferred_parent_plural, but can't use column aliases in the where clause sadly
          in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

This will return results like this:
enter image description here

Then you can examine any naming convention exceptions detected with:

-- Now list any inferred parents that weren't real tables to see see why (irregular plurals and columns not named according to convention)
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural 
,       C.TABLE_NAME as child_table
from    COLUMNS C
JOIN    TABLES T  on  C.TABLE_NAME    = T.TABLE_NAME 
                  and C.TABLE_SCHEMA  = T.TABLE_SCHEMA 
                  and T.TABLE_TYPE   != "VIEW"            -- filter out views, comment this line if you want to include them
where   COLUMN_NAME like "%_id"
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- this time only include inferred names that aren't real tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")
          not in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

This will return results like this, which you can process manually:
enter image description here

You can modify these scripts to spit out whatever is useful to you, include foreign key create statements, if you want to. Here, the final column is a simple 'has many' relationship statement. I use this in a tool I've built called called pidgin, which is rapid modelling tool that draws relationship diagrams on the fly based on relationship statements written a very simple syntax (called 'pidgin'). You can check it out at http://pidgin.gruffdavies.com

I've run the above script on a demo DB to show you the sort of results you can expect:

enter image description here

I haven't catered for the irregular plurals in my script, but I might have a go at that too, at least for the case of entities ending in -y. If you want to have a try at that yourself, I'd recommend writing a stored function that takes <name>_id column names as a parameter, strips the _id part and then applies some heuristics to attempt to pluralise correctly.

Hope that's useful!

追我者格杀勿论 2024-12-05 22:18:17

以下产品均声称提供外键发现功能:

统计方法能够提供一种相似度排名,例如范围分布和创建时间柯克建议,
似乎是正确的方法。
..我需要使用 SAS EG 或任何免费工具来实现它。

The following products are all claiming to provide foreign-keys discovery abilities:

Statistical methodologies able to provide a kind of similarity rank like range distribution and creation time as suggested by Kirk,
seems to be the right way.
.. I'd need to implement it using SAS EG or any free tool.

多彩岁月 2024-12-05 22:18:17

我不知道哪些软件可以帮助您搜索所需的内容,但以下查询将帮助您入门。它列出了当前数据库中的所有外键关系。

SELECT
    K_Table = FK.TABLE_NAME,
    FK_Column = CU.COLUMN_NAME,
    PK_Table = PK.TABLE_NAME,
    PK_Column = PT.COLUMN_NAME,
    Constraint_Name = C.CONSTRAINT_NAME
FROM
    INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK
    ON C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK
    ON C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU
    ON C.CONSTRAINT_NAME = CU.CONSTRAINT_NAME
INNER JOIN (
            SELECT
                i1.TABLE_NAME,
                i2.COLUMN_NAME
            FROM
                INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
            INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2
                ON i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
            WHERE
                i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
           ) PT
    ON PT.TABLE_NAME = PK.TABLE_NAME

希望这有帮助。

I don't know about the softwares which may help in searching what you require, but The following query will help to get you started. It lists all Foreign Key Relationships within the current database.

SELECT
    K_Table = FK.TABLE_NAME,
    FK_Column = CU.COLUMN_NAME,
    PK_Table = PK.TABLE_NAME,
    PK_Column = PT.COLUMN_NAME,
    Constraint_Name = C.CONSTRAINT_NAME
FROM
    INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK
    ON C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK
    ON C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU
    ON C.CONSTRAINT_NAME = CU.CONSTRAINT_NAME
INNER JOIN (
            SELECT
                i1.TABLE_NAME,
                i2.COLUMN_NAME
            FROM
                INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
            INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2
                ON i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
            WHERE
                i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
           ) PT
    ON PT.TABLE_NAME = PK.TABLE_NAME

Hope this helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文