摆脱全索引扫描
以下查询执行得很差,因为对 P4FileReleases 中的 650 万条记录进行了完整的非聚集索引扫描,然后进行了哈希联接。我正在寻找优化器选择扫描而不是搜索的可能原因。
SELECT p4f.FileReleaseID
FROM P4FileReleases p4f
INNER JOIN AnalyzedFileView af
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
WHERE (af.tracked_change_id = 1)
据我所知,我认为优化器没有理由选择 P4FileReleases 扫描。 WHERE 子句将正确数据集的大小限制为大约 1K 条记录,优化器应该知道这一点(请参见下面的直方图)。
事实上,如果我获取视图数据并将其放入堆表(与索引视图相同的结构)中,则查询是通过在较大表上进行索引查找和内部联接循环而不是哈希联接来执行的(并且总成本从 145 下降到 1) 左右。
关于什么可能会导致优化器失效有什么想法吗?
详细信息。 Sql Server 2008(v. 10.0.2757.0)。
P4FileReleases 表 保存 650 万条记录
CREATE TABLE [dbo].[P4FileReleases](
[FileReleaseID] [int] IDENTITY(1,1) NOT NULL,
[FileRelease] [varchar](254) NOT NULL,
-- 5 more fields
CONSTRAINT [CIX_P4FileReleases_FileReleaseID_PK] PRIMARY KEY CLUSTERED
(
[FileReleaseID] ASC
),
CONSTRAINT [NCIX_P4FileReleases_FileRelease] UNIQUE NONCLUSTERED
(
[FileRelease] ASC
)
AnalyzedFileView 是启用统计且最新的索引视图。
它有四列:
key int (int, PK) - clustered index
tracked_change_id (int, FK) - non-unique, non-clustered index (covering 'path', 'revision')
path (nvarchar(1024), null)
revision (smallint, null)
tracked_change_id 直方图:
1 0 1222 0 1
4 0 787 0 1
8 0 2754 0 1
12 0 254 0 1
13 0 34 0 1
查询计划
|--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([Expr1011])=([Expr1010]), RESIDUAL:([Expr1010]=[Expr1011]))
|--Bitmap(HASH:([Expr1011]), DEFINE:([Bitmap1015]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
| |--Compute Scalar(DEFINE:([Expr1011]=([qpsitools].[dbo].[analyzed_file_view].[path]+N'#')+CONVERT_IMPLICIT(nvarchar(30),CONVERT(varchar(30),[qpsitools].[dbo].[analyzed_file_view].[revision],0),0)))
| |--Index Seek(OBJECT:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]), SEEK:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]=(1)) ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1010]), WHERE:(PROBE([Bitmap1015],[Expr1010])))
|--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(nvarchar(254),[Blueprint].[dbo].[P4FileReleases].[FileRelease] as [p4f].[FileRelease],0)))
|--Index Scan(OBJECT:([Blueprint].[dbo].[P4FileReleases].[NCIX_P4FileReleases_FileRelease] AS [p4f]))
The following query performs badly because of a full non-clustered index scan of 6.5 million records in P4FileReleases followed by a hash join. I'm looking for possible reasons the optimizer picks a scan over a seek.
SELECT p4f.FileReleaseID
FROM P4FileReleases p4f
INNER JOIN AnalyzedFileView af
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
WHERE (af.tracked_change_id = 1)
From what I can tell, I see no reason for the optimizer to pick a scan of P4FileReleases. The WHERE clause limits the size of the right dataset to about 1K of records and the optimizer should know it (see the histogram below).
If fact, if I take the view data and throw it into a heap table (same structure as the indexed view), then the query is performed with an index seek on the larger table and an inner join loop instead of a hash join (and the total cost drops from 145 to around 1).
Any ideas on what might be throwing the optimizer off?
Details. Sql Server 2008 (v. 10.0.2757.0).
P4FileReleases table
Holds 6.5 million records
CREATE TABLE [dbo].[P4FileReleases](
[FileReleaseID] [int] IDENTITY(1,1) NOT NULL,
[FileRelease] [varchar](254) NOT NULL,
-- 5 more fields
CONSTRAINT [CIX_P4FileReleases_FileReleaseID_PK] PRIMARY KEY CLUSTERED
(
[FileReleaseID] ASC
),
CONSTRAINT [NCIX_P4FileReleases_FileRelease] UNIQUE NONCLUSTERED
(
[FileRelease] ASC
)
AnalyzedFileView
is an indexed view with statistics enabled and up-to-date.
It has four columns:
key int (int, PK) - clustered index
tracked_change_id (int, FK) - non-unique, non-clustered index (covering 'path', 'revision')
path (nvarchar(1024), null)
revision (smallint, null)
tracked_change_id histogram:
1 0 1222 0 1
4 0 787 0 1
8 0 2754 0 1
12 0 254 0 1
13 0 34 0 1
Query Plan
|--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([Expr1011])=([Expr1010]), RESIDUAL:([Expr1010]=[Expr1011]))
|--Bitmap(HASH:([Expr1011]), DEFINE:([Bitmap1015]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
| |--Compute Scalar(DEFINE:([Expr1011]=([qpsitools].[dbo].[analyzed_file_view].[path]+N'#')+CONVERT_IMPLICIT(nvarchar(30),CONVERT(varchar(30),[qpsitools].[dbo].[analyzed_file_view].[revision],0),0)))
| |--Index Seek(OBJECT:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]), SEEK:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]=(1)) ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1010]), WHERE:(PROBE([Bitmap1015],[Expr1010])))
|--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(nvarchar(254),[Blueprint].[dbo].[P4FileReleases].[FileRelease] as [p4f].[FileRelease],0)))
|--Index Scan(OBJECT:([Blueprint].[dbo].[P4FileReleases].[NCIX_P4FileReleases_FileRelease] AS [p4f]))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您正在将 varchar 列 p4f.FileRelease 与 nvarchar 列 (af.path) 连接起来。由于数据类型不匹配,SQL 必须将一个类型转换为另一个类型(当然它不能从 nvarchar 转换为 varchar)。在将 af.path 转换为 nvarchar 时,它失去了使用索引查找/过滤这些值的能力,导致需要扫描和转换所有可能的行。
最佳解决方案是将数据存储为匹配的数据类型(将列 p4f.FileRelase 更改为 nvarchar,或将 af.path 更改为 varchar)。由于没有人可以修改现有的数据库结构,因此一种解决方法可能是在查询中将 af.path 显式转换为 varchar。测试一下并看看...当然,如果数据确实需要双字节格式,您就不能这样做。
You are joining varchar column p4f.FileRelease with an nvarchar column (af.path). Since the data types don't match, SQL has to convert one's type to the other's (and of course it can't go from nvarchar to varchar). In converting af.path to nvarchar, it loses the ability to use the index to lookup/filter those values, resulting in the need to scan and convert all possible rows.
The best solution is to store the data as matching data types (change column p4f.FileRelase to nvarchar, or af.path to varchar). Since no one ever gets to modify existing database structures, a work-around might be to explicitly cast af.path to varchar in the query. Test it and see... though of course you can't do this if the data truly requires double-byte formatting.
你的问题不是 WHERE 而是 JOIN,你正在获得隐式转换并在 JOIN 上进行扫描,在 WHERE 条件下你获得 SEEK
并行性也可能是一个问题,请尝试添加 MAXDOP=1
您的统计信息是否是最新的?是否存在过度碎片化?
your problem is not the WHERE but the JOIN, you are getting an implicit conversion and a scan on the JOIN, on the WHERE condition you are getting a SEEK
Parallelism could also be a problem, try adding MAXDOP=1
Are your statistics up to date? Is there excessive fragmentation?
尝试将“af.tracked_change_id = 1”移至连接子句中。
WHERE 在 INNER JOIN 之后应用
Try moving "af.tracked_change_id = 1" into the join clause.
WHERE is applied after the INNER JOIN
菲利普·凯利发现了问题所在。这是 P4FileReleases 中的 varchar 与AnalysedFileView 中的 nvarchar 之间的数据类型不匹配。
Philip Kelley spotted the problem. It was a datatype mismatch between varchar in P4FileReleases and nvarchar in AnalyzedFileView.