自然连接——关系理论和 SQL
这个问题来自于我对 CJ Date 的SQL 和关系理论:如何编写准确的 SQL 代码的阅读,以及在互联网上查找有关联接的内容(其中包括在此处遇到多篇关于 NATURAL JOIN 的帖子(以及关于 SQL服务器缺乏对它的支持))
所以这是我的问题...
一方面,在关系理论中,自然连接是唯一应该发生的连接(或者至少是高度首选的)。
另一方面,在 SQL 中,建议不要使用 NATURAL JOIN 而是使用替代方法(例如带限制的内连接)。
这些的协调是:
- 自然连接在真正的 RDBMS 中工作。然而,SQL 无法完全再现关系模型,而且流行的 SQL DBMS 都不是真正的 RDBMS。
和/或
- 良好/更好的表设计应该消除/最小化自然连接产生的问题。
?
This question comes from my readings of C.J Date's SQL and Relational Theory: How to Write Accurate SQL Code and looking up about joins on the internet (which includes coming across multiple posts here on NATURAL JOINs (and about SQL Server's lack of support for it))
So here is my problem...
On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred).
On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction).
Is the reconciliation of these that:
- Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS.
and / or
- Good/Better table design should remove/minimise the problems that natural join creates.
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
关于您的问题的一些要点(即使我担心我没有真正回答您所问的任何问题),
“一方面,在关系理论中,自然连接是唯一应该发生的连接(或者至少是高度首选的连接) )”。
这似乎表明您将理论解释为好像它禁止“其他类型”的连接......事实并非如此。关系理论并没有说“你不能有反连接”,或者“你永远不应该使用反连接”,或者类似的东西。它所说的是,在关系代数中,可以识别一组原始运算符,其中自然连接是唯一的“类连接”运算符。所有其他“类连接”运算符始终可以用定义的原始运算符等效地表示。例如,笛卡尔积是自然连接的一种特殊情况(其中公共属性集为空),并且如果您想要两个表的笛卡尔积,这两个表确实具有共同的属性名称,您可以使用 RENAME 解决此问题。例如,半连接是第一个表与第二个表上的一些投影的自然连接。例如,反连接(Date 书中的 SEMIMINUS 或 NOT MATCHING)是第一个表与两个表的 SEMIJOIN 之间的关系差异。 “另一方面
,在 SQL 中,建议不要使用 NATURAL JOIN,而是使用替代方法(例如带限制的内连接)。”
哪里有这样的建议?在 SQL 标准中?我真的不这么认为。区分 SQL 语言本身(由 ISO 标准定义)和该语言的某些(/任何)特定实现(由某个特定供应商构建)非常重要。如果 Microsoft 建议其客户不要在 SQL Server 200x 中使用 NJ,那么该建议与某人不要在 SQL 中完全使用 NJ 的建议具有完全不同的含义。
“自然连接在真正的 RDBMS 中起作用。然而,SQL 无法完全复制关系模型,而且流行的 SQL DBMS 都不是真正的 RDBMS。”
虽然 SQL 本身确实未能忠实地遵守关系理论,但这实际上与 NJ 的问题没有多大关系。
实现是否为 NJ 调用提供良好的性能,是该实现的特征,而不是语言或“R”的“真实程度”的特征在“关系数据库管理系统”中。构建不使用 SQL 的 TRDBMS 非常容易,但这给 NJ 带来了荒谬的执行时间。 SQL 语言本身具有支持 NJ 所需的一切。如果实现支持NJ,那么NJ也将在该实现中工作。是否提供良好的性能是该实现的一个特征,某些特定实现的较差性能不应该“外推”到其他实现,或者被视为 SQL 语言本身的特征。
“好的/更好的表设计应该消除/最小化自然连接产生的问题。”
自然连接产生的问题?通过在所需的列上添加显式投影(并根据需要重命名),可以轻松控制出现在联接参数中的列。就像您也想尽可能避免 SELECT * 一样,出于基本相同的原因......
a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),
"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."
This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.
"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."
Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.
"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."
While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.
Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.
"Good/Better table design should remove/minimise the problems that natural join creates."
Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...
首先,理论与实践之间的选择是一个谬论。引用克里斯·戴特的话:“事实是,理论——至少是我在这里谈论的理论,即关系理论——确实是非常实用的”。
其次,考虑自然连接依赖于属性命名。请(重新)阅读《准确的 SQL 代码》一书的以下部分:
6.12。对属性名称的依赖。重要引述:
3.9. SQL 中的列命名。重要引述:
我想解决 @kuru kuru pa 的观点(也是一个很好的观点),即将列添加到您无法控制的表中,例如“您正在使用的 Web 服务”。在我看来,使用第 3.9 节中的 Date 建议的策略(上面提到的)可以有效缓解这个问题: quote:
就我个人而言,我发现“自然加入被认为是危险的”态度令人沮丧。不希望听起来自以为是,而是我自己的命名约定,遵循 ISO 11179-5 命名和识别原则,导致模式非常适合自然连接。
遗憾的是,我专业使用的 DBMS 产品(SQL Server)可能不会很快支持自然连接:Microsoft Connect 上的相关功能请求
目前因“无法修复”而关闭,尽管目前得分为+38 / -2已重新开放并获得了可观的 46 / -2 分数
(现在就去投票吧:)
First, the choice between theory and being practical is a fallacy. To quote Chris Date: "the truth is that theory--at least the theory I'm talking about here, which is relational theory--is most definitely very practical indeed".
Second, consider that natural join relies on attribute naming. Please (re)read the following sections of the Accurate SQL Code book:
6.12. The Reliance on Attribute Names. Salient quote:
3.9. Column Naming in SQL. Salient quote:
I'd like to address @kuru kuru pa's point (a good one too) about columns being added to a table over which you have no control, such as a "web service you're consuming." It seems to me that this problem is effectively mitigated using the strategy suggested by Date in section 3.9 (referenced above): quote:
Personally, I find the "natural join considered dangerous" attitude frustrating. Not wishing to sound self-righteous but my own naming convention, which follows the guidance of ISO 11179-5 Naming and identification principles, results in schema highly suited to natural join.
Sadly, natural join perhaps won't be supported anytime soon in the DBMS product I use professionally (SQL Server): the relevant feature request on Microsoft Connect
is currently closed as "won't fix" despite currently having a respectable +38 / -2 scorehas been reopened and gained a respectable 46 / -2 score
(go vote for it now :)
SQL 中 NATURAL JOIN 语法的主要问题是它通常过于冗长。
在教程 D 语法中,我可以非常简单地将自然联接编写为:
但在 SQL 中,SELECT 语句需要派生表子查询或 WHERE 子句和别名来实现相同的效果。这是因为单个“SELECT 语句”实际上是一个非关系型复合运算符,其中组件操作始终按预定顺序发生。投影发生在连接之后,并且连接结果中的列不一定具有唯一的名称。
例如,上面的查询可以用 SQL 编写为:
或:
人们可能更喜欢后一个版本,因为它更短且“更简单”。
The main problem with the NATURAL JOIN syntax in SQL is that it is typically too verbose.
In Tutorial D syntax I can very simply write a natural join as:
But in SQL the SELECT statement needs either derived table subqueries or a WHERE clause and aliases to achieve the same thing. That's because a single "SELECT statement" is really a non-relational, compound operator in which the component operations always happen in a predetermined order. Projection comes after joins and columns in the result of a join don't necessarily have unique names.
E.g. the above query can be written in SQL as:
or:
People will likely prefer the latter version because it is shorter and "simpler".
理论与现实...
自然连接并不实用。
据我所知,不存在纯粹的(即实践与理论相同)RDBMS。
我认为 Oracle 和其他一些实际上支持自然连接——TSQL 不支持。
考虑一下我们生活的世界 - 两个表各有一个具有相同名称的列的可能性非常高(例如 [name] 或 [id] 或 [date] 等)。也许通过仅对您可能真正想要加入的那些表进行分组,可以稍微缩小这些机会。但无论如何,如果不仔细检查表结构,您将不知道“自然连接”是否是一个好主意。即使是这样,在那一刻,可能不会再过一年,应用程序就会升级,向某些表等添加列,或者您正在使用的 Web 服务添加您不知道的字段, 我认为一个“纯粹”的
系统必须是一个你至少可以 100% 控制的系统,然后也是一个在修改表/创建表过程中进行良好验证的系统,该验证会警告/阻止你在某个表中创建一个新列可能是“自然地”连接到您可能不希望它能够连接的其他表。
我想我的底线是,重视我的理智,希望我的应用程序有最大的正常运行时间,重视快速/干净的维护和升级等。在这种情况下,良好的表设计意味着不使用自然连接(永远) 。
Theory versus reality...
Natural joins are not practical.
There is no such thing as a pure (i.e. practice is idetical to theory) RDBMS, as far as I know.
I think Oracle and a few others actually support support natural joins -- TSQL doesn't.
Consider the world we live in -- chances of two tables each having a column with the same name is pretty high (like maybe [name] or [id] or [date], etc.). Maybe those chances are narrowed down a bit by grouping only those tables you might actually want to join. But regardless, without a careful examination of the table structure, you won't know if a "natural join" is a good idea or not. And even if it is, at that moment, it might not be in another year when the application gets an upgrade which adds columns to certain tables, etc., or the web service you're consuming adds fields you didn't know about, etc.
I think a "pure" system would have to be one you had 100% control over at a minimum, and then also, one that would have some good validation in the alter table / create table process that would warn / prevent you from creating a new column in some table that could be "naturally" joined to some other table you might not be intending it to be join-able to.
I guess bottom-line for me would be, valuing my sanity, wanting my applications to have maximum up-time, valuing quick/clean maintenance and upgrades, etc. -- good table design in this context means not using natural joins (ever).