查询中的查询:有更好的方法吗?
当我构建更大、更高级的 Web 应用程序时,我发现自己正在编写非常长且复杂的查询。我经常在查询中编写查询,因为我觉得从 PHP 调用数据库一次比多次调用数据库并关联数据要好。
然而,任何了解 SQL 的人都知道 JOIN
。就我个人而言,我以前使用过一两个 JOIN,但当我发现使用子查询时很快就停止了,因为我感觉编写和维护起来更容易、更快捷。
通常,我会执行可能包含相关表中的一个或多个子查询的子查询。
考虑这个例子:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
很少,我会在 WHERE
子句之后执行子查询。
考虑这个例子:
SELECT
user_id,
(SELECT name FROM organizations WHERE (SELECT organization FROM locations WHERE records.location = location_id) = organization_id) AS organization_name
FROM records
ORDER BY in_timestamp
在这两种情况下,如果我决定使用 JOIN
重写查询,我会看到任何改进吗?
作为一个笼统的问题,使用子查询或 JOIN 的优点/缺点是什么?一种方式比另一种方式更正确或更容易被接受吗?
As I build bigger, more advanced web applications, I'm finding myself writing extremely long and complex queries. I tend to write queries within queries a lot because I feel making one call to the database from PHP is better than making several and correlating the data.
However, anyone who knows anything about SQL knows about JOIN
s. Personally, I've used a JOIN
or two before, but quickly stopped when I discovered using subqueries because it felt easier and quicker for me to write and maintain.
Commonly, I'll do subqueries that may contain one or more subqueries from relative tables.
Consider this example:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Rarely, I'll do subqueries after the WHERE
clause.
Consider this example:
SELECT
user_id,
(SELECT name FROM organizations WHERE (SELECT organization FROM locations WHERE records.location = location_id) = organization_id) AS organization_name
FROM records
ORDER BY in_timestamp
In these two cases, would I see any sort of improvement if I decided to rewrite the queries using a JOIN
?
As more of a blanket question, what are the advantages/disadvantages of using subqueries or a JOIN
? Is one way more correct or accepted than the other?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在简单的情况下,查询优化器应该能够为简单连接和简单子选择生成相同的计划。
但一般来说(在适当的情况下),您应该更喜欢连接而不是子选择。
另外,您应该避免相关子查询(内部表达式引用外部表达式的查询),因为它们实际上是 for 循环中的 for 循环)。在大多数情况下,相关子查询可以写为联接。
In simple cases, the query optimiser should be able to produce identical plans for a simple join versus a simple sub-select.
But in general (and where appropriate), you should favour joins over sub-selects.
Plus, you should avoid correlated subqueries (a query in which the inner expression refer to the outer), as they are effectively a for loop within a for loop). In most cases a correlated subquery can be written as a join.
JOIN 优于分离[子]查询。
如果子选择(又名子查询)与外部查询不相关,则优化器很可能会扫描子选择中的表一次,因为该值不太可能更改。当您具有相关性时(如提供的示例中所示),单遍优化的可能性变得非常不可能。过去,人们认为相关子查询执行的是 RBAR——Row By Agonizing Row。使用 JOIN,可以实现相同的结果,同时确保对表进行单次传递。
这是对所提供查询的正确重写:
...因为如果
USERS
表中不存在 user_id,则子查询可以返回 NULL。否则,您可以使用 INNER JOIN:派生表/内联视图也可以使用 JOIN 语法。
JOINs are preferable to separate [sub]queries.
If the subselect (AKA subquery) is not correlated to the outer query, it's very likely the optimizer will scan the table(s) in the subselect once because the value isn't likely to change. When you have correlation, like in the example provided, the likelihood of single pass optimization becomes very unlikely. In the past, it's been believed that correlated subqueries execute, RBAR -- Row By Agonizing Row. With a JOIN, the same result can be achieved while ensuring a single pass over the table.
This is a proper re-write of the query provided:
...because the subselect can return NULL if the user_id doesn't exist in the
USERS
table. Otherwise, you could use an INNER JOIN:Derived tables/inline views are also possible using JOIN syntax.
a) 我首先要指出,两者不一定可以互换。嵌套要求有 0 或 1 个匹配值,否则会出现错误。连接没有这样的要求,并且可能会排除记录或引入更多记录,具体取决于您的数据和连接类型。
b) 在性能方面,您将需要检查查询计划,但嵌套示例不太可能比表连接更有效。通常,子查询每行执行一次,但这很大程度上取决于您的数据库、唯一约束、外键、非空等。也许数据库可以更有效地重写,但联接可以使用更广泛的技术,驱动来自不同的数据表等,因为它们做不同的事情(尽管根据您的数据,您可能不会观察到输出有任何差异)。
c) 我认识的大多数了解数据库的程序员都会查看嵌套查询并使用联接重写,前提是数据适当“干净”。
d) 关于“正确性”——我倾向于在必要时对数据进行适当的约束(例如唯一的用户 ID)来备份连接。作为一个人,您可能会做出某些假设,但数据库引擎不能,除非您告诉它。它知道的越多,它(和你)就能做得越好。
a) I'd start by pointing out that the two are not necessarily interchangable. Nesting as you have requires there to be 0 or 1 matching value otherwise you will get an error. A join puts no such requirement and may exclude the record or introduce more depending on your data and type of join.
b) In terms of performance, you will need to check the query plans but your nested examples are unlikely to be more efficient than a table join. Typically sub-queries are executed once per row but that very much depends on your database, unique constraints, foriegn keys, not null etc. Maybe the DB can rewrite more efficiently but joins can use a wider variety of techniques, drive the data from different tables etc because they do different things (though you may not observe any difference in your output depending on your data).
c) Most DB aware programmers I know would look at your nested queries and rewrite using joins, subject to the data being suitably 'clean'.
d) Regarding "correctness" - I would favour joins backed up with proper constraints on your data where necessary (e.g. a unique user ID). You as a human may make certain assumptions but the DB engine cannot unless you tell it. The more it knows, the better job it (and you) can do.
在大多数情况下,连接会更快。
让我们举个例子。
让我们使用你的第一个查询:
现在假设我们有 100 条记录在记录中,100 条记录在用户中。(假设我们在 user_id 上没有索引)
所以如果我们理解你的算法,它会说:
对于每条记录
扫描users中全部100条记录,找出用户名
扫描 users 中的所有 100 条记录以找出姓氏和名字
因此,就像我们扫描 users 表 100*100*2 次一样。真的值得吗。如果我们考虑在 user_id 上建立索引,事情会变得更好,但这仍然值得吗?
现在考虑一个连接(嵌套循环几乎会产生与上面相同的结果,但考虑哈希连接):
它就像。
制作用户的哈希图。
对于每条记录
在Hashmap中查找一条映射记录。这肯定比循环和查找记录快得多。
很明显,加入应该是有利的。
注意:使用 100 条记录的示例可能会产生相同的计划,但其想法是分析它如何影响性能。
Joins in most cases will be much more faster.
Lets take this with an example.
Lets use your first query:
Now consider we have 100 records in records and 100 records in user.(Assuming we dont have index on user_id)
So if we understand your algorithm it says:
For each record
Scan all 100 records in users to find out username
Scan all 100 records in users to find out last name and first name
So its like we scanned users table 100*100*2 time. Is it really worth. If we consider index on user_id it will make thing better, but is it still worth.
Now consider a join (nested loop will almost produce same result as above, but consider a hash join):
Its like.
Make a hash map of user.
For each record
Find a mapping record in Hashmap. Which will be certainly much more faster then looping and finding a record.
So clearly, joins should be favorable.
NOTE: Example used of 100 record may produce identical plan, but the idea is to analyze how it can effect the performance.