为什么这个(不相关的)子查询会导致这样的问题?
我有一个大型查询,其中一个简单的子查询优化将其从 8 分钟缩短到 20 秒。我不确定我是否理解为什么优化会产生如此巨大的效果。
本质上,这是问题部分:
SELECT (bunch of stuff)
FROM
a LEFT OUTER JOIN b ON a.ID = b.a
LEFT OUTER JOIN c ON b.ID = c.b
...
...
INNER JOIN veryLargeTable
ON a.ID = veryLargeTable.a
AND veryLargeTable.PetID =
(SELECT id from Pets WHERE Pets.Name = 'Something') /* BAD! */
...
...
总共有 16 个连接表。如果我将 veryLargeTable
连接的第二个谓词替换为包含 petID 的预填充变量(而不是使用子查询),则整个查询速度将显着提高:
AND veryLargeTable.PetID = @petID /* Awesome! */
显然,每行都会执行 (SELECT id from Pets WHERE Name = 'Something')
。有两件事我不完全理解:
据我所知,这是一个不相关的子查询。 Pets 表根本不是外部查询的一部分。非相关子查询不是独立评估(并因此优化)的吗?为什么这里的情况不是这样?
执行计划截然不同。在失败的情况下(上面),整个子树处理估计 950k 行。在获胜的情况下(使用变量而不是子查询),估计只有大约 125k 行。这是怎么回事?如果存在该子查询,为什么会涉及这么多行? Pets.Name 列肯定具有唯一的数据(但据我所知,没有唯一的约束)。
请注意,正如我所期望的那样,将谓词移至 WHERE 子句不会影响任何一种情况下的查询,因为它是 INNER JOIN。
见解赞赏!
I've got a large query where a simple subquery optimization dropped it from 8 minutes down to 20 seconds. I'm not sure I understand why the optimization had such a drastic effect.
In essence, here's the problem part:
SELECT (bunch of stuff)
FROM
a LEFT OUTER JOIN b ON a.ID = b.a
LEFT OUTER JOIN c ON b.ID = c.b
...
...
INNER JOIN veryLargeTable
ON a.ID = veryLargeTable.a
AND veryLargeTable.PetID =
(SELECT id from Pets WHERE Pets.Name = 'Something') /* BAD! */
...
...
In all, there are 16 joined tables. If I replace the second predicate of the veryLargeTable
join with a pre-populated variable containing the petID (instead of using the subquery) the entire query speeds up dramatically:
AND veryLargeTable.PetID = @petID /* Awesome! */
Obviously, (SELECT id from Pets WHERE Name = 'Something')
is being executed for every row. There are two things I don't fully understand:
As far as I can tell, this is a non-correlated subquery. The Pets table is not part of the outer query at all. Aren't non-correlated subqueries independently evaluated (and hence optimized)? Why isn't this the case here?
The execution plans are dramatically different. In the failure case (above), the entire subtree deals with an estimated 950k rows. In the win case (using a variable instead of a subquery), there's only about 125k estimated rows. What's going on? Why are so many more rows involved if that subquery is there? The Pets.Name column definitely has unique data (but no unique constraint as far as I can tell).
Note that moving the predicate to the WHERE clause doesn't affect the query in either case, as I would expect, since it's an INNER JOIN.
Insights appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
根据我的经验,查询越复杂,SQL 优化器创建灵活计划的能力就越差。这里你有 16 个连接,一些或大部分是外部连接,你至少有一个子查询......扔进足够的索引,基数,视图,外部应用,谁知道还有什么,没有人,甚至微软都没有工程师*,可以找出统一、定期生成最佳计划的例程。
你所描述的,我已经经历过很多次了——在混乱的查询中改变一个简单的事情,一切都会快一个数量级(或者,咬牙切齿,慢一些)。我没有办法确定什么时候复杂就太复杂了,这更多的是一种感觉。我的一般经验法则是,如果它看起来太长或太复杂,请尽可能简化 - 例如您预先选择的单个嵌套值,或者打破查询的一部分总是运行快速使用小结果集,首先运行它并将结果存储在临时表中。
(*请注意,这是轻微的讽刺)
It has been my experience that, the more complex your queries get, the less able the SQL optimizer is to create deft plans. Here you've got 16 joins, some or most are outer joins, you've got at least one subquery... toss in enough indexes, cardinalities, views, outer applies, and who knows what else and no one, not even Microsoft engineers*, can figure out routines that will uniformly and regularly generate The most optimal plans.
What you've described, I've experienced numerous times -- change one simple thing in a messy query and everything's an order of magnitude faster (or, gnashes teeth, slower). I have no method for determining when complex is too complex, it's more a feeling than anything else. My general rule of thumb is, if it looks too long or too complex, simplify where you can--such as your pre-selected single nested value, or breaking out part of the query than will always run fast with a small result set, and running it first and storing the results in a temp table.
( * Please note that this is mild sarcsam)
作为替代方案,我认为您可以使用以下方法消除子查询:
As an alternative, I think you could eliminate the sub-query with:
我个人认为,如果 Pets.Name 上没有索引,结果并不奇怪。如果您在 Pets.Name 上创建唯一索引,您可能会看到更好的结果。从服务器的角度来看,如果没有索引,子查询可能会返回多行或 NULL。也许优化器可以做得更好;它经常需要帮助。
I personally think the outcome is not surprising if there's no index on Pets.Name. If you create a unique index on Pets.Name you are likely to see better results. Without index from the server's point of view the subquery may return multiple rows or NULL. Perhaps optimiser could do better; it often needs help.
原因是正如您所指出的,根据我的经验,即使是最简单的非相关子查询通常也会被 SQL Server 的查询优化器重新计算。
例如,您可以查看以下查询的执行计划,发现不相关的子查询被重新计算。
属性上有或没有聚集索引,在本例中为“ID”。正如有人指出的,您可以重写此查询以使用联接而不是子查询。然而,在许多情况下,如果子查询返回聚合标量,
那么连接重写可能不会那么容易进行。
The reason is as you pointed out, and from my experience, that often even the simplest non-correlated subqueries are often recomputed by SQL Server's query optimizer.
For example, you can look at the execution plan of the following query and see that the non-correlated subquery is recomputed.
This is with or without a clustered index on the attribute, "ID" in this case. As someone pointed out, you could rewrite this query to use a join instead of the subquery. In many cases that can be done, however, if the subquery returns instead an aggregate scalar e.g.
then a join rewrite might not work so easily.