为什么执行计划包含对持久计算列的用户定义函数调用?

发布于 2024-11-07 15:24:32 字数 536 浏览 0 评论 0原文

我有一个包含 2 个计算列的表,其中两个列的“Is Persisted”都设置为 true。但是,当在查询中使用它们时,执行计划会显示用于计算列的 UDF 作为计划的一部分。由于列数据是在添加/更新行时由 UDF 计算的,为什么计划会包含它?

当这些列包含在查询中时,查询速度非常慢(> 30 秒),而当它们被排除时,查询速度快如闪电(< 1 秒)。这使我得出结论,查询实际上是在运行时计算列值,但事实不应该如此,因为它们被设置为持久化。

我在这里错过了什么吗?

更新:这里有更多关于我们使用计算列的推理的信息。

我们是一家体育公司,有一个客户将完整的球员姓名存储在单列中。他们要求我们允许他们分别按名字和/或姓氏搜索玩家数据。值得庆幸的是,他们对玩家姓名使用一致的格式 - 姓氏、名字(昵称) - 因此解析它们相对容易。我创建了一个 UDF,它调用 CLR 函数以使用正则表达式解析名称部分。显然,调用 UDF(进而调用 CLR 函数)的成本非常高。但由于它仅在持久化列上使用,我认为它只会在一天中将数据导入数据库的几次中使用。

I have a table with 2 computed columns, both of which has "Is Persisted" set to true. However, when using them in a query the Execution Plan shows the UDF used to compute the columns as part of the plan. Since the column data is calculated by the UDF when the row is added/updated why would the plan include it?

The query is incredibly slow (>30s) when these columns are included in the query, and lightning fast (<1s) when they are excluded. This leads me to conclude that the query is actually calculating the column values at run time, which shouldn't be the case since they are set to persisted.

Am I missing something here?

UPDATE: Here's a little more info regarding our reasoning for using the computed column.

We are a sports company and have a customer who stores full player names in a single column. They require us to allow them to search player data by first and/or last name separately. Thankfully they use a consistent format for player names - LastName, FirstName (NickName) - so parsing them is relatively easy. I created a UDF that calls into a CLR function to parse the name parts using a regular expression. So obviously calling the UDF, which in turn calls a CLR function, is very expensive. But since it is only used on a persisted column I figured it would only be used during the few times a day that we import data into the database.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

骄兵必败 2024-11-14 15:24:32

原因是查询优化器在计算用户定义函数的成本方面做得不太好。在某些情况下,它决定完全重新评估每一行的函数会更便宜,而不是引发可能需要的磁盘读取。

SQL Server的成本计算模型不会检查函数的结构来了解它到底有多昂贵,因此优化器没有这方面的准确信息。您的函数可以任意复杂,因此成本计算受到这种方式的限制也许是可以理解的。对于标量函数和多语句表值函数来说,效果最差,因为每行调用这些函数的成本非常昂贵。

您可以通过检查查询计划来判断优化器是否决定重新评估函数(而不是使用持久值)。如果存在一个计算标量迭代器,其显式引用其定义值列表中的函数名称,则该函数将每行调用一次。如果“定义值”列表引用列名称,则不会调用该函数。

我的建议通常是根本不要在计算列定义中使用函数。

下面的复制脚本演示了该问题。请注意,为表定义的主键是非聚集的,因此获取持久值将需要从索引进行书签查找或表扫描。优化器认为从索引读取函数的源列并重新计算每行的函数会更便宜,而不是产生书签查找或表扫描的成本。

在这种情况下,对持久列建立索引可以加快查询速度。一般来说,优化器倾向于采用避免重新计算函数的访问路径,但决策是基于成本的,因此即使在索引时,仍然有可能看到为每行重新计算的函数。尽管如此,为优化器提供“明显”且有效的访问路径确实有助于避免这种情况。

请注意,该列不必必须保留才能建立索引。这是一个非常常见的误解;只有在不精确的情况下才需要保留列(它使用浮点算术或值)。在当前情况下保留该列不会增加任何价值,并且会增加基表的存储需求。

保罗·怀特

-- An expensive scalar function
CREATE FUNCTION dbo.fn_Expensive(@n INTEGER)
RETURNS BIGINT 
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @sum_n BIGINT;
    SET @sum_n = 0;

    WHILE @n > 0
    BEGIN
        SET @sum_n = @sum_n + @n;
        SET @n = @n - 1
    END;

    RETURN @sum_n;
END;
GO
-- A table that references the expensive
-- function in a PERSISTED computed column
CREATE TABLE dbo.Demo
(
    n       INTEGER PRIMARY KEY NONCLUSTERED,
    sum_n   AS dbo.fn_Expensive(n) PERSISTED
);
GO
-- Add 8000 rows to the table
-- with n from 1 to 8000 inclusive
WITH Numbers AS
(
    SELECT TOP (8000)
        n = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
    FROM master.sys.columns AS C1
    CROSS JOIN master.sys.columns AS C2
    CROSS JOIN master.sys.columns AS C3
)
INSERT dbo.Demo (N.n)
SELECT
    N.n
FROM Numbers AS N
WHERE
    N.n >= 1
    AND N.n <= 5000
GO
-- This is slow
-- Plan includes a Compute Scalar with:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
-- QO estimates calling the function is cheaper than the bookmark lookup
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the computed column
-- Notice the actual plan also calls the function for every row, and includes:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Query now uses the index, and is fast
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Drop the index
DROP INDEX uq1 ON dbo.Demo;
GO
-- Don't persist the column
ALTER TABLE dbo.Demo
ALTER COLUMN sum_n DROP PERSISTED;
GO
-- Show again, as you would expect
-- QO has no option but to call the function for each row
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the non-persisted column
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Fast again
-- Persisting the column bought us nothing
-- and used extra space in the table
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Clean up
DROP TABLE dbo.Demo;
DROP FUNCTION dbo.fn_Expensive;
GO

The reason is that the query optimizer does not do a very good job at costing user-defined functions. It decides, in some cases, that it would be cheaper to completely re-evaluate the function for each row, rather than incur the disk reads that might be necessary otherwise.

SQL Server's costing model does not inspect the structure of the function to see how expensive it really is, so the optimizer does not have accurate information in this regard. Your function could be arbitrarily complex, so it is perhaps understandable that costing is limited this way. The effect is worst for scalar and multi-statement table-valued functions, since these are extremely expensive to call per-row.

You can tell whether the optimizer has decided to re-evaluate the function (rather than using the persisted value) by inspecting the query plan. If there is a Compute Scalar iterator with an explicit reference to the function name in its Defined Values list, the function will be called once per row. If the Defined Values list references the column name instead, the function will not be called.

My advice is generally not to use functions in computed column definitions at all.

The reproduction script below demonstrates the issue. Notice that the PRIMARY KEY defined for the table is nonclustered, so fetching the persisted value would require a bookmark lookup from the index, or a table scan. The optimizer decides it is cheaper to read the source column for the function from the index and re-compute the function per row, rather than incur the cost of a bookmark lookup or table scan.

Indexing the persisted column speeds the query up in this case. In general, the optimizer tends to favour an access path that avoids re-computing the function, but the decision is cost-based so it is still possible to see a function re-computed for each row even when indexed. Nevertheless, providing an 'obvious' and efficient access path to the optimizer does help to avoid this.

Notice that the column does not have to be persisted in order to be indexed. This is a very common misconception; persisting the column is only required where it is imprecise (it uses floating-point arithmetic or values). Persisting the column in the present case adds no value and expands the base table's storage requirement.

Paul White

-- An expensive scalar function
CREATE FUNCTION dbo.fn_Expensive(@n INTEGER)
RETURNS BIGINT 
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @sum_n BIGINT;
    SET @sum_n = 0;

    WHILE @n > 0
    BEGIN
        SET @sum_n = @sum_n + @n;
        SET @n = @n - 1
    END;

    RETURN @sum_n;
END;
GO
-- A table that references the expensive
-- function in a PERSISTED computed column
CREATE TABLE dbo.Demo
(
    n       INTEGER PRIMARY KEY NONCLUSTERED,
    sum_n   AS dbo.fn_Expensive(n) PERSISTED
);
GO
-- Add 8000 rows to the table
-- with n from 1 to 8000 inclusive
WITH Numbers AS
(
    SELECT TOP (8000)
        n = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
    FROM master.sys.columns AS C1
    CROSS JOIN master.sys.columns AS C2
    CROSS JOIN master.sys.columns AS C3
)
INSERT dbo.Demo (N.n)
SELECT
    N.n
FROM Numbers AS N
WHERE
    N.n >= 1
    AND N.n <= 5000
GO
-- This is slow
-- Plan includes a Compute Scalar with:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
-- QO estimates calling the function is cheaper than the bookmark lookup
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the computed column
-- Notice the actual plan also calls the function for every row, and includes:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Query now uses the index, and is fast
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Drop the index
DROP INDEX uq1 ON dbo.Demo;
GO
-- Don't persist the column
ALTER TABLE dbo.Demo
ALTER COLUMN sum_n DROP PERSISTED;
GO
-- Show again, as you would expect
-- QO has no option but to call the function for each row
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the non-persisted column
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Fast again
-- Persisting the column bought us nothing
-- and used extra space in the table
SELECT
    MAX(sum_n)
FROM dbo.Demo;
GO
-- Clean up
DROP TABLE dbo.Demo;
DROP FUNCTION dbo.fn_Expensive;
GO
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文