执行 SQL 语句时,可以在过滤之前应用标量函数吗?
我想我总是天真地假设 SQL 查询的选择部分中的标量函数只会应用于满足 where 子句的所有条件的行。
今天,我正在调试供应商的一些代码,并且这个假设受到了挑战。我能想到此代码失败的唯一原因是 Substring() 函数正在调用本应由 WHERE 子句过滤掉的数据。但似乎在过滤发生之前应用了子字符串调用,查询失败。 这是我的意思的一个例子。假设我们有两个表,每个表有 2 列,分别有 2 行和 1 行。每个的第一列只是一个 id。 NAME 只是一个字符串,NAME_LENGTH 告诉我们名称中有多少个字符具有相同的 ID。请注意,只有具有多个字符的名称才会在 LONG_NAMES 表中具有相应的行。
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
如果我想要一个查询打印每个名称并截掉最后 3 个字母,我可能会首先尝试这样的操作(现在假设 SQL Server 语法):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
我很快就会发现这会给我一个错误,因为当它达到“X”它将尝试在子字符串调用中使用负数,并且会失败。 我的供应商决定解决这个问题的方法是过滤掉字符串太短而 len - 3 查询无法工作的行。他通过连接到另一个表来做到这一点:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
乍一看,这个查询看起来可能有效。连接条件将消除 NAME 字段足够短而导致子字符串调用失败的任何行。
然而,据我观察,SQL Server 有时会尝试计算表中所有内容的子字符串表达式,然后应用联接来过滤掉行。这应该是这样发生的吗?是否有记录在案的操作顺序,我可以从中了解某些事情何时会发生?它是特定于特定数据库引擎还是 SQL 标准的一部分?如果我决定在 NAMES 表中包含一些谓词来过滤掉短名称(例如 len(NAME) > 3),SQL Server 是否也可以在尝试应用子字符串后选择应用该谓词?如果是这样,那么似乎执行子字符串的唯一安全方法是将其包装在选择中的“case when”构造中?
I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
马丁给出的这个链接几乎解释了正在发生的事情 - 查询优化器可以自由地重新排序,但它喜欢。我将此作为答案包含在内,以便我可以接受某些内容。马丁,如果您创建一个包含链接的答案,我会很乐意接受这个答案,而不是这个。
我确实想把我的问题留在这里,因为我认为这是一个很难搜索的问题,而且我对这个问题的特殊措辞可能会让其他人将来更容易找到。
尽管没有列,但遇到 TSQL 除以零的情况包含 0
编辑:随着更多回复的出现,我再次感到困惑。目前还不清楚到底什么时候允许优化器评估 select 子句中的内容。我想我得自己去找一下 SQL 标准,看看我是否能理解它。
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko 帮助编写了早期的 SQL 标准,他在各种 USENET 新闻组中多次发布了类似的内容。 (我会跳过不适用于您的 SELECT 语句的子句。)他通常会说“这就是语句应该如何发挥作用”之类的话。换句话说,SQL 实现的行为应该与执行这些步骤完全相同,而不需要实际执行其中每个步骤。
FROM 中的表构造函数
条款。
不满足 WHERE 的行
条款。
针对工作表的 SELECT 子句。
因此,在此之后,任何 SQL dbms 都不应该像在应用 WHERE 子句一样之前对 SELECT 子句中的函数进行求值。
在最近的一篇文章中,Joe 扩展了步骤以包含 CTE。
CJ Date 和 Hugh Darwen 在他们的《SQL 标准指南》一书中的第 11 章(“表表达式”)中说了基本上相同的内容。他们还指出,本章对应于 SQL 标准中的“查询规范”部分(章节?)。
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
the table constructors in the FROM
clause.
rows that do not satisfy the WHERE
clause.
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
您正在考虑称为查询执行计划的东西。它基于查询优化规则、索引、临时缓冲区和执行时间统计信息。如果您使用 SQL Managment Studio,您的查询编辑器上有一个工具箱,您可以在其中查看估计的执行计划,它显示您的查询将如何更改以提高速度。因此,如果刚刚使用您的名称表并且它位于缓冲区中,引擎可能会首先尝试子查询您的数据,然后将其与其他表连接。
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.