这 2 个 sql 查询在所有方面（例如估计的和实际的执行计划）是否相同？

发布于 2024-08-18 00:05:56 字数 594 浏览 3 评论 0原文

就估计查询计划和实际计划而言，查询 1) == 2) 是吗？（统计数据会影响这里的实际计划吗？）

声明 @cat int -- input param from prc

...

select * 
from A as a
  join B as b
    on b.id = a.id
    on b.cat = @cat
  join C as c
    on c.fid = b.fid
    on c.cat = @cat
  where a.cat = @cat

select * 
from A as a
  join B as b
    on b.id = a.id
    on b.cat = a.cat
  join C as c
    on c.fid = b.fid
    on c.cat = b.cat
  where a.cat = @cat

在我看来，这些在逻辑上应该是等效的，并且执行计划应该始终相同，无论如何表中的实际差异。在连接或何处添加更多条件，或添加更多表进行连接不应改变这一点。

有没有这样的情况？

原文

Are query 1) == 2) in terms of estimated query plan AND actual plan? (can statistics affect the actual plan here, ever?)

declare @cat int -- input param from prc

...

select * 
from A as a
  join B as b
    on b.id = a.id
    on b.cat = @cat
  join C as c
    on c.fid = b.fid
    on c.cat = @cat
  where a.cat = @cat

select * 
from A as a
  join B as b
    on b.id = a.id
    on b.cat = a.cat
  join C as c
    on c.fid = b.fid
    on c.cat = b.cat
  where a.cat = @cat

It seems to me that these should logically be equivalent and the execution plan should always be the same regardless of actual difference in tables. And adding more conditions either in join, or where, or add more tables to join shouldn't change this.

Are there cases this is not true?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏了南城 2024-08-25 00:05:56

优化器可能会受到 a.cat、b.cat 或 c.cat 中是否存在可用索引以及该索引是否也包含相关 id 或 fid 列的影响。它可能足够聪明，可以将简单的 WHERE 子句谓词下推到表级操作。它还可能受到表统计信息的影响。（如果参数的值恰好在 C 中出现一次，那么开始处理 C 可能比开始处理 A 效率更高；或者可能从 B 开始更高效；或者可能仍然更高效）从 A 开始。优化器在执行语句之前不知道参数的值，而不是在准备好时才能看到它，这可能很重要，也可能无关紧要。）

因此，正如评论中已经暗示的那样，如果不了解有关您正在使用的系统的更多信息（包括架构等），就无法做出明确的声明。

好消息是结果内容应该是相同的 - 不能保证执行计划。

我注意到大多数 SQL 系统每个连接都需要一个关键字 ON：

select * 
from A as a
  join B as b
    on b.id = a.id
    AND b.cat = @cat
  join C as c
    on c.fid = b.fid
    AND c.cat = @cat
  where a.cat = @cat

Xerion 询问：

鉴于执行计划可能不同，哪种查询方式“更好”。更好的定义是：
更有可能进行优化以提供更好的计划（或者不太可能混淆 SQL 优化器）。
更符合 SQL 约定。

这在一定程度上取决于 DBMS 及其优化器；在这一领域，经验主义是无可替代的，但请记住，在一个时间点凭经验确定的内容在另一个时间点可能是错误的，因为：

数据集的大小已更改
数据集现在具有不同的统计分布
优化器可能具有可能
有不同的可用索引

进行连接的方法有很多，并且它（仍然）依赖于可用的索引。我可能会这样写：

SELECT * 
  FROM A JOIN B ON b.id  = a.id  AND b.cat = a.cat
         JOIN C ON c.fid = b.fid AND c.cat = b.cat
 WHERE a.cat = @cat
   AND b.cat = @cat
   AND c.cat = @cat

这里的逻辑是，A(id, cat) 和 B(id, cat) 上可能有索引，B(fid, cat) 和 C(fid, cat) 上可能有索引，并且因此，优化器可以充分利用这些索引。 WHERE 子句包含两个冗余术语，但让优化器知道需要什么，并明确告诉它它自己可能无法推断出什么。如果您对优化器的质量有信心（并且已经检查了它生成的查询计划），那么您可以消除 WHERE 子句中三个条件中的两个。

如果将参数放在 ON 子句中，优化器可能无法充分利用索引，否则，您必须再次进行试验以了解优化器的行为方式。

最后，正如已经提到的，表上的索引集至关重要，并且确保 DBMS 所需的任何统计信息都是最新的通常也很重要。另请记住，如果表最初很小，优化器可能会在表变大时选择不同的查询计划 - 因此，请在合理的级别进行测试。除非您的生产表仅包含几十行，否则不要对仅包含几十行的表进行查询性能测试。

The optimizer could be affected by whether there is a usable index on any of a.cat, b.cat or c.cat - and whether that index includes the relevant id or fid column too. It might be smart enough to push the simple WHERE clause predicate down to the table level operations. It might also be affected by statistics for the tables. (If the value of the parameter happens to appear just once in C, it may be much more efficient to start processing C than to start processing A; or it may be more efficient to start with B; or it may still be more efficient to start with A. And it may or may not matter that the optimizer doesn't know the value of the parameter until the statement is executed, rather than being able to see it when it is prepared.)

So, as already intimated in a comment, there is no definitive statement that can be made without knowing a lot more about the system you are using - including the schemas and so on.

The good news is that the result contents should be the same - there can be no guarantee about the execution plans.

I note that most SQL systems would require a single keyword ON for each join:

select * 
from A as a
  join B as b
    on b.id = a.id
    AND b.cat = @cat
  join C as c
    on c.fid = b.fid
    AND c.cat = @cat
  where a.cat = @cat

Xerion asked:

Given that the execution plans may differ, which fashion of query is "better". Where better is defined as:
More likely to be optimized to give better plan (or less likely to confuse SQL optimizer).
More consistent with SQL convention.

It depends a bit on the DBMS and its optimizer; there is no substitute for empiricism in this area, but remember that what is empirically determined at one point in time may be erroneous at another point in time because:

The data sets have changed size
The data sets have different statistical distributions now
The optimizer may have changed
There may be different indexes available

There are many ways to do the joins, and it does (still) depend on the indexes available. I would probably write:

SELECT * 
  FROM A JOIN B ON b.id  = a.id  AND b.cat = a.cat
         JOIN C ON c.fid = b.fid AND c.cat = b.cat
 WHERE a.cat = @cat
   AND b.cat = @cat
   AND c.cat = @cat

The logic here is that there are likely to be indexes on A(id, cat) and B(id, cat), and on on B(fid, cat) and C(fid, cat), and the optimizer can therefore fully exploit those indexes. The WHERE clause contains two redundant terms, but lets the optimizer know what is required and tells it explicitly what it may not otherwise deduce for itself. If you are confident in the quality of the optimizer (and have checked the query plans it produces), then you might eliminate two of the three conditions in the WHERE clause.

If you place the parameter in the ON clauses, the optimizer might not make as good use of indexes as it can without that - again, you must experiment to find out how your optimizer behaves.

Finally, as already intimated, the set of indexes on the tables is crucial, and ensuring that any statistics the DBMS needs are up to date is often important. Also remember that if the tables are initially small, the optimizer may choose a different query plan from when the tables grow big - so do your testing at sensible levels. Unless you production tables will only contain a few tens of rows, don't do your performance testing for the query on tables that have only a few tens of rows.

回复收藏 0 原文

~没有更多了~