连接与子查询
我是一个老派的 MySQL 用户,并且总是更喜欢 JOIN
而不是子查询。但现在每个人都使用子查询,我讨厌它;我不知道为什么。
我缺乏理论知识来自行判断是否有任何差异。子查询是否与 JOIN 一样好,因此无需担心?
I am an old-school MySQL user and have always preferred JOIN
over sub-query. But nowadays everyone uses sub-query, and I hate it; I don't know why.
I lack the theoretical knowledge to judge for myself if there is any difference. Is a sub-query as good as a JOIN
and therefore is there nothing to worry about?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(20)
仅当第二个连接表的数据明显多于主表时,才会看到差异。我有过如下的经历...
我们有一个包含 10 万个条目的用户表,以及大约 30 万个条目的会员数据(友谊)。这是一个 join 语句,目的是获取朋友及其数据,但延迟很大。但当成员资格表中只有少量数据时,它工作得很好。一旦我们将其更改为使用子查询,它就可以正常工作。
但与此同时,连接查询正在处理条目数少于主表的其他表。
所以我认为连接和子查询语句工作正常,这取决于数据和情况。
The difference is only seen when the second joining table has significantly more data than the primary table. I had an experience like below...
We had a users table of one hundred thousand entries and their membership data (friendship) about 3 hundred thousand entries. It was a join statement in order to take friends and their data, but with a great delay. But it was working fine where there was only a small amount of data in the membership table. Once we changed it to use a sub-query it worked fine.
But in the mean time the join queries are working with other tables that have fewer entries than the primary table.
So I think the join and sub query statements are working fine and it depends on the data and the situation.
如今,许多数据库都可以优化子查询和连接。因此,您只需使用解释检查您的查询,看看哪个更快。如果性能没有太大差异,我更喜欢使用子查询,因为它们简单且易于理解。
These days, many dbs can optimize subqueries and joins. Thus, you just gotto examine your query using explain and see which one is faster. If there is not much difference in performance, I prefer to use subquery as they are simple and easier to understand.
我不是关系数据库专家,所以对此持保留态度。
关于子查询与连接的总体思路是评估较大查询所采用的路径。
为了执行较大的查询,必须首先执行每个单独的子查询,然后将结果集存储为较大查询与之交互的临时表。
该临时表没有索引,因此任何比较都需要扫描整个结果集。
相反,当您使用联接时,所有索引都在使用中,因此比较需要遍历索引树(或哈希表),这在速度方面要便宜得多。
现在,我不知道最流行的关系引擎的较新版本是否执行反向求值,并仅在临时表中加载必要的元素,作为一种优化方法。
I am not a relational database expert, so take this with a grain of salt.
The general idea about sub queries vs joins is the path the evaluation of the larger query takes.
In order to perform the larger query, every individual subquery has to be executed first, and then the resultset is stored as a temporary table that the larger query interacts with.
This temporary table is unindexed, so, any comparison requires scanning the whole resultset.
In contrast, when you use a join, all indexes are in use and so, comparison require traversing index trees (or hash tables), which is way less expensive in terms of speed.
Now, what I don't know if newer versions of the most popular relational engines execute the evaluation on reverse, and just load the necessary elements in the temporary table, as an optimization method.
我只是在考虑同样的问题,但我在 FROM 部分使用子查询。
我需要从大表进行连接和查询,“从”表有2800万条记录,但结果只有128,这么小的结果大数据!我正在使用 MAX() 函数。
首先我使用 LEFT JOIN 因为我认为这是正确的方法,mysql 可以优化等。
第二次只是为了测试,我重写为针对 JOIN 进行子选择。
LEFT JOIN 运行时间:1.12s
SUB-SELECT 运行时间:0.06s
子选择比连接快 18 倍!就在 chokito adv 中。子选择看起来很糟糕,但结果......
I just thinking about the same problem, but I am using subquery in the FROM part.
I need connect and query from large tables, the "slave" table have 28 million record but the result is only 128 so small result big data! I am using MAX() function on it.
First I am using LEFT JOIN because I think that is the correct way, the mysql can optimalize etc.
Second time just for testing, I rewrite to sub-select against the JOIN.
LEFT JOIN runtime: 1.12s
SUB-SELECT runtime: 0.06s
18 times faster the subselect than the join! Just in the chokito adv. The subselect looks terrible but the result ...
这取决于几个因素,包括您正在运行的特定查询、数据库中的数据量。子查询首先运行内部查询,然后再次从结果集中过滤出实际结果。而在 join 中则一次性运行并产生结果。
最好的策略是您应该测试连接解决方案和子查询解决方案以获得优化的解决方案。
It depends on several factors, including the specific query you're running, the amount of data in your database. Subquery runs the internal queries first and then from the result set again filter out the actual results. Whereas in join runs the and produces the result in one go.
The best strategy is that you should test both the join solution and the subquery solution to get the optimized solution.
如果你想使用 join 加速查询:
对于“inner join/join”,
不要使用 where 条件,而是在“ON”条件下使用它。
例如:
对于“左/右连接”,
不要在“ON”条件下使用,因为如果您使用左/右连接,它将获取任何一个表的所有行。因此,在“On”条件下使用它是没有用的。所以,尝试使用“Where”条件
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition
子查询是解决“从 A 获取事实,以 B 的事实为条件”形式的问题的逻辑上正确的方法。在这种情况下,将 B 放在子查询中比进行联接更具逻辑意义。从实际意义上讲,它也更安全,因为您不必因为与 B 进行多场比赛而担心从 A 获取重复的事实。
然而,实际上,答案通常归结为性能。有些优化器在给定连接和子查询时会很糟糕,而有些优化器则相反,这是特定于优化器、特定于 DBMS 版本和特定于查询的。
从历史上看,显式联接通常会获胜,因此公认的观点是联接更好,但优化器一直在变得更好,因此我更喜欢首先以逻辑连贯的方式编写查询,然后在性能限制允许的情况下进行重组。
Sub-queries are the logically correct way to solve problems of the form, "Get facts from A, conditional on facts from B". In such instances, it makes more logical sense to stick B in a sub-query than to do a join. It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B.
Practically speaking, however, the answer usually comes down to performance. Some optimisers suck lemons when given a join vs a sub-query, and some suck lemons the other way, and this is optimiser-specific, DBMS-version-specific and query-specific.
Historically, explicit joins usually win, hence the established wisdom that joins are better, but optimisers are getting better all the time, and so I prefer to write queries first in a logically coherent way, and then restructure if performance constraints warrant this.
在大多数情况下,
JOIN
比子查询更快,而子查询更快的情况很少见。在 JOIN 中,RDBMS 可以创建一个更适合您的查询的执行计划,并且可以预测应该加载哪些数据进行处理并节省时间,这与子查询不同,子查询将运行所有查询并加载所有数据进行处理。
子查询的好处是它们比 JOIN 更具可读性:这就是大多数 SQL 新人更喜欢子查询的原因;这是最简单的方法;但就性能而言,JOINS 在大多数情况下更好,尽管它们也不难阅读。
In most cases
JOIN
s are faster than sub-queries and it is very rare for a sub-query to be faster.In
JOIN
s RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.The good thing in sub-queries is that they are more readable than
JOIN
s: that's why most new SQL people prefer them; it is the easy way; but when it comes to performance, JOINS are better in most cases even though they are not hard to read too.摘自 MySQL 手册 (13.2. 10.11 将子查询重写为连接):
因此子查询可能比 LEFT [OUTER] JOIN 慢,但在我看来,它们的优点是可读性稍高。
Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):
So subqueries can be slower than
LEFT [OUTER] JOIN
, but in my opinion their strength is slightly higher readability.在 2010 年,我会加入这个问题的作者,并会强烈投票支持
JOIN
,但是有了更多的经验(特别是在 MySQL 中),我可以说:是的,子查询可以更好。我在这里阅读了多个答案;一些声明的子查询速度更快,但缺乏很好的解释。我希望我能为一个人提供这个(非常)迟到的答案:首先,让我说最重要的一点:子查询有不同形式
第二个重要的陈述:大小重要事项
如果您使用子查询,您应该了解数据库服务器如何执行子查询。特别是如果子查询计算一次或每行计算一次!
另一方面,现代数据库服务器能够进行很多优化。在某些情况下,子查询有助于优化查询,但较新版本的 DB-Server 可能会使优化变得过时。
选择字段中的子查询
请注意,对于
foo
中的每个结果行都会执行子查询。如果可能的话,避免这种情况;它可能会大大减慢您对大型数据集的查询速度。但是,如果子查询没有对 foo 的引用,则它可以由数据库服务器作为静态内容进行优化,并且只能评估一次。
Where 语句中的子查询
如果幸运的话,数据库会在内部将其优化为
JOIN
。如果没有,您的查询在大型数据集上将变得非常非常慢,因为它将对 foo 中的每一行执行子查询,而不仅仅是像 select-type 中的结果。Join 语句中的子查询
这很有趣。我们将
JOIN
与子查询结合起来。在这里我们得到了子查询的真正优势。想象一下wilco
中有数百万行的数据集,但只有几个不同的me
。我们现在有一个较小的临时表可以连接,而不是连接一个巨大的表。根据数据库大小,这可能会导致查询速度更快。您可以使用CREATE TEMPORARY TABLE ...
和INSERT INTO ... SELECT ...
获得相同的效果,这可能会为非常复杂的查询提供更好的可读性(但可以将数据集锁定在可重复读隔离级别)。嵌套子查询
您可以在多个级别嵌套子查询。如果您必须对结果进行分组或更改,这可以帮助处理庞大的数据集。通常数据库服务器会为此创建一个临时表,但有时您不需要对整个表进行某些操作,只需对结果集进行操作。根据表的大小,这可能会提供更好的性能。
结论
子查询不能替代
JOIN
并且您不应该像这样使用它们(尽管可能)。以我的拙见,子查询的正确使用是用作CREATE TEMPORARY TABLE ...
的快速替换。良好的子查询以JOIN
的ON
语句无法完成的方式减少数据集。如果子查询具有关键字 GROUP BY 或 DISTINCT 之一,并且最好不要位于 select 字段或 where 语句中,那么它可能会大大提高性能。In the year 2010 I would have joined the author of this questions and would have strongly voted for
JOIN
, but with much more experience (especially in MySQL) I can state: Yes subqueries can be better. I've read multiple answers here; some stated subqueries are faster, but it lacked a good explanation. I hope I can provide one with this (very) late answer:First of all, let me say the most important: There are different forms of sub-queries
And the second important statement: Size matters
If you use sub-queries, you should be aware of how the DB-Server executes the sub-query. Especially if the sub-query is evaluated once or for every row!
On the other side, a modern DB-Server is able to optimize a lot. In some cases a subquery helps optimizing a query, but a newer version of the DB-Server might make the optimization obsolete.
Sub-queries in Select-Fields
Be aware that a sub-query is executed for every resulting row from
foo
.Avoid this if possible; it may drastically slow down your query on huge datasets. However, if the sub-query has no reference to
foo
it can be optimized by the DB-server as static content and could be evaluated only once.Sub-queries in the Where-statement
If you are lucky, the DB optimizes this internally into a
JOIN
. If not, your query will become very, very slow on huge datasets because it will execute the sub-query for every row infoo
, not just the results like in the select-type.Sub-queries in the Join-statement
This is interesting. We combine
JOIN
with a sub-query. And here we get the real strength of sub-queries. Imagine a dataset with millions of rows inwilco
but only a few distinctme
. Instead of joining against a huge table, we have now a smaller temporary table to join against. This can result in much faster queries depending on database size. You can have the same effect withCREATE TEMPORARY TABLE ...
andINSERT INTO ... SELECT ...
, which might provide better readability on very complex queries (but can lock datasets in a repeatable read isolation level).Nested sub-queries
You can nest sub-queries in multiple levels. This can help on huge datasets if you have to group or change the results. Usually the DB-Server creates a temporary table for this, but sometimes you do not need some operations on the whole table, only on the resultset. This might provide a much better performance depending on the size of the table.
Conclusion
Sub-queries are no replacement for a
JOIN
and you should not use them like this (although possible). In my humble opinion, the correct use of a sub-query is the use as a quick replacement ofCREATE TEMPORARY TABLE ...
. A good sub-query reduces a dataset in a way you cannot accomplish in anON
statement of aJOIN
. If a sub-query has one of the keywordsGROUP BY
orDISTINCT
and is preferably not situated in the select fields or the where statement, then it might improve performance a lot.使用 EXPLAIN 查看数据库如何对数据执行查询。这个答案中有一个巨大的“这取决于”...
当 PostgreSQL 认为其中一个比另一个更快时,它可以将子查询重写为联接或联接到子查询。这一切都取决于数据、索引、相关性、数据量、查询等。
Use EXPLAIN to see how your database executes the query on your data. There is a huge "it depends" in this answer...
PostgreSQL can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other. It all depends on the data, indexes, correlation, amount of data, query, etc.
首先,要比较两者,您应该将带有子查询的查询区分为:
对于第一类
如今连 mysql 也这么做了。
尽管如此,有时并非如此,但这并不意味着连接总是会获胜——我在 mysql 中使用子查询提高了性能时遇到过这样的情况。 (例如,如果有一些因素阻止 mysql 规划器正确估计成本,并且如果规划器没有看到相同的连接变体和子查询变体,那么子查询可以通过强制特定路径来超越连接)。
结论是,如果您想确定哪一种性能更好,则应该测试连接查询和子查询变体的查询。
对于第二类,比较没有意义,因为这些查询不能使用联接重写,在这些情况下,子查询是完成所需任务的自然方式,您不应该歧视它们。
First of all, to compare the two first you should distinguish queries with subqueries to:
For the first class of queries a good RDBMS will see joins and subqueries as equivalent and will produce same query plans.
These days even mysql does that.
Still, sometimes it does not, but this does not mean that joins will always win - I had cases when using subqueries in mysql improved performance. (For example if there is something preventing mysql planner to correctly estimate the cost and if the planner doesn't see the join-variant and subquery-variant as same then subqueries can outperform the joins by forcing a certain path).
Conclusion is that you should test your queries for both join and subquery variants if you want to be sure which one will perform better.
For the second class the comparison makes no sense as those queries can not be rewritten using joins and in these cases subqueries are natural way to do the required tasks and you should not discriminate against them.
我认为引用的答案中没有充分强调的是重复问题以及特定(使用)案例可能产生的有问题的结果。
(尽管 Marcelo Cantos 确实提到过)
我将引用斯坦福大学 Lagunita SQL 课程中的示例。
学生表
申请表
(向特定大学和专业提出的申请)
让我们尝试查找已申请
CS
专业的学生的 GPA 分数(无论是哪所大学)使用子查询:
此结果集的平均值为:
使用联接:
此结果集的平均值:
很明显,第二次尝试会产生误导我们的用例中的结果是,它会计算平均值的重复项。
同样明显的是,将
distinct
与基于 join 的语句一起使用不会消除问题,因为它会错误地保留出现次数的三分之一。 3.9 分。正确的情况是考虑
3.9
分数出现两 (2) 次,因为我们实际上有两 (2) 名学生具有该分数符合我们的查询条件。在某些情况下,除了任何性能问题之外,子查询似乎是最安全的方法。
I think what has been under-emphasized in the cited answers is the issue of duplicates and problematic results that may arise from specific (use) cases.
(although Marcelo Cantos does mention it)
I will cite the example from Stanford's Lagunita courses on SQL.
Student Table
Apply Table
(applications made to specific universities and majors)
Let's try to find the GPA scores for students that have applied to
CS
major (regardless of the university)Using a subquery:
The average value for this resultset is:
Using a join:
average value for this resultset:
It is obvious that the second attempt yields misleading results in our use case, given that it counts duplicates for the computation of the average value.
It is also evident that usage of
distinct
with the join - based statement will not eliminate the problem, given that it will erroneously keep one out of three occurrences of the3.9
score. The correct case is to account for TWO (2) occurrences of the3.9
score given that we actually have TWO (2) students with that score that comply with our query criteria.It seems that in some cases a sub-query is the safest way to go, besides any performance issues.
MSDN SQL Server 文档显示
因此,如果您需要类似的东西,
请尝试使用 join 代替。在其他情况下,这没有什么区别。
我说:为子查询创建函数可以消除混乱的问题,并允许您为子查询实现额外的逻辑。因此,我建议尽可能为子查询创建函数。
代码混乱是一个大问题,业界几十年来一直在努力避免它。
MSDN Documentation for SQL Server says
so if you need something like
try to use join instead. In other cases, it makes no difference.
I say: Creating functions for subqueries eliminate the problem of cluttter and allows you to implement additional logic to subqueries. So I recommend creating functions for subqueries whenever possible.
Clutter in code is a big problem and the industry has been working on avoiding it for decades.
根据我对两个案例的观察,如果一个表的记录少于 100,000 条,那么连接将会快速运行。
但如果一个表的记录超过 100,000 条,那么子查询是最好的结果。
我有一个表,其中包含我在下面创建的查询中的 500,000 条记录,其结果时间如下
As per my observation like two cases, if a table has less then 100,000 records then the join will work fast.
But in the case that a table has more than 100,000 records then a subquery is best result.
I have one table that has 500,000 records on that I created below query and its result time is like
有人说“某些 RDBMS 可以将子查询重写为连接或将连接重写为子查询,当它认为一个比另一个快。”,但是这一说法适用于简单的情况,当然不适用于带有子查询的复杂查询,这实际上会导致性能问题。
Some people say "some RDBMS can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other.", but this statement applies to simple cases, surely not for complicated queries with subqueries which actually cause a problems in performance.
从旧的 Mambo CMS 在非常大的数据库上运行:
0 秒
~3 秒
EXPLAIN 显示它们检查完全相同的行数,但一次需要 3 秒,一次几乎是即时的。这个故事的寓意是什么?如果性能很重要(什么时候不是?),请尝试多种方法,看看哪一种最快。
并且...
0 秒
再次,相同的结果,相同的检查行数。我的猜测是,DISTINCT mos_content.catid 比 DISTINCT mos_categories.id 花费的时间要长得多。
Run on a very large database from an old Mambo CMS:
0 seconds
~3 seconds
An EXPLAIN shows that they examine the exact same number of rows, but one takes 3 seconds and one is near instant. Moral of the story? If performance is important (when isn't it?), try it multiple ways and see which one is fastest.
And...
0 seconds
Again, same results, same number of rows examined. My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does.
子查询通常用于返回单行作为原子值,但它们也可用于使用 IN 关键字将值与多行进行比较。它们几乎可以出现在 SQL 语句中任何有意义的位置,包括目标列表、WHERE 子句等。可以使用简单的子查询作为搜索条件。例如,在一对表之间:
请注意,对子查询的结果使用普通值运算符要求只能返回一个字段。如果您有兴趣检查一组其他值中是否存在单个值,请使用 IN:
这显然与 LEFT-JOIN 不同,在 LEFT-JOIN 中您只想连接表 A 和 B 中的内容,即使连接-condition 在表 B 等中找不到任何匹配的记录。
如果您只是担心速度,则必须检查数据库并编写一个好的查询,看看性能是否有任何显着差异。
Subqueries are generally used to return a single row as an atomic value, though they may be used to compare values against multiple rows with the IN keyword. They are allowed at nearly any meaningful point in a SQL statement, including the target list, the WHERE clause, and so on. A simple sub-query could be used as a search condition. For example, between a pair of tables:
Note that using a normal value operator on the results of a sub-query requires that only one field must be returned. If you're interested in checking for the existence of a single value within a set of other values, use IN:
This is obviously different from say a LEFT-JOIN where you just want to join stuff from table A and B even if the join-condition doesn't find any matching record in table B, etc.
If you're just worried about speed you'll have to check with your database and write a good query and see if there's any significant difference in performance.
MySQL版本:5.5.28-0ubuntu0.12.04.2-log
我也觉得JOIN总是比MySQL中的子查询好,但是EXPLAIN是更好的判断方式。下面是一个子查询比 JOIN 效果更好的示例。
这是我的带有 3 个子查询的查询:
EXPLAIN 显示:
使用 JOIN 的相同查询是:
输出是:
rows
列的比较可以看出差异,并且使用 JOIN 的查询正在使用使用临时;使用文件排序
。当然,当我运行这两个查询时,第一个查询在 0.02 秒内完成,第二个查询甚至在 1 分钟后也没有完成,所以 EXPLAIN 正确解释了这些查询。
如果我在
list_tag
表上没有 INNER JOIN,即如果我从第一个查询中删除并相应地:
从第二个查询中删除,则 EXPLAIN 为两个查询以及这两个查询返回相同的行数跑得同样快。
MySQL version: 5.5.28-0ubuntu0.12.04.2-log
I was also under the impression that JOIN is always better than a sub-query in MySQL, but EXPLAIN is a better way to make a judgment. Here is an example where sub queries work better than JOINs.
Here is my query with 3 sub-queries:
EXPLAIN shows:
The same query with JOINs is:
and the output is:
A comparison of the
rows
column tells the difference and the query with JOINs is usingUsing temporary; Using filesort
.Of course when I run both the queries, the first one is done in 0.02 secs, the second one does not complete even after 1 min, so EXPLAIN explained these queries properly.
If I do not have the INNER JOIN on the
list_tag
table i.e. if I removefrom the first query and correspondingly:
from the second query, then EXPLAIN returns the same number of rows for both queries and both these queries run equally fast.
子查询能够动态计算聚合函数。
例如,找到该书的最低价格并获取以此价格出售的所有书籍。
1) 使用子查询:
2) 使用 JOIN
Subqueries have ability to calculate aggregation functions on a fly.
E.g. Find minimal price of the book and get all books which are sold with this price.
1) Using Subqueries:
2) using JOINs