摆脱全索引扫描

发布于 2024-09-08 20:46:58 字数 2695 浏览 3 评论 0原文

以下查询执行得很差,因为对 P4FileReleases 中的 650 万条记录进行了完整的非聚集索引扫描,然后进行了哈希联接。我正在寻找优化器选择扫描而不是搜索的可能原因。

SELECT p4f.FileReleaseID 
   FROM P4FileReleases p4f
   INNER JOIN AnalyzedFileView af 
      ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))  
   WHERE (af.tracked_change_id = 1)

据我所知,我认为优化器没有理由选择 P4FileReleases 扫描。 WHERE 子句将正确数据集的大小限制为大约 1K 条记录,优化器应该知道这一点(请参见下面的直方图)。

事实上,如果我获取视图数据并将其放入堆表(与索引视图相同的结构)中,则查询是通过在较大表上进行索引查找和内部联接循环而不是哈希联接来执行的(并且总成本从 145 下降到 1) 左右。

关于什么可能会导致优化器失效有什么想法吗?

详细信息。 Sql Server 2008(v. 10.0.2757.0)。

P4FileReleases 表 保存 650 万条记录

CREATE TABLE [dbo].[P4FileReleases](
    [FileReleaseID] [int] IDENTITY(1,1) NOT NULL,
    [FileRelease] [varchar](254) NOT NULL,
    -- 5 more fields 
 CONSTRAINT [CIX_P4FileReleases_FileReleaseID_PK] PRIMARY KEY CLUSTERED 
(
    [FileReleaseID] ASC
),
CONSTRAINT [NCIX_P4FileReleases_FileRelease] UNIQUE NONCLUSTERED 
(
    [FileRelease] ASC
)

AnalyzedFileView 是启用统计且最新的索引视图。

它有四列:

   key int (int, PK) - clustered index
   tracked_change_id (int, FK) - non-unique, non-clustered index (covering 'path', 'revision')
   path (nvarchar(1024), null) 
   revision (smallint, null)

tracked_change_id 直方图:

1   0   1222    0   1
4   0   787     0   1
8   0   2754    0   1
12  0   254     0   1
13  0   34      0   1

查询计划

  |--Parallelism(Gather Streams)
       |--Hash Match(Inner Join, HASH:([Expr1011])=([Expr1010]), RESIDUAL:([Expr1010]=[Expr1011]))
            |--Bitmap(HASH:([Expr1011]), DEFINE:([Bitmap1015]))
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
            |         |--Compute Scalar(DEFINE:([Expr1011]=([qpsitools].[dbo].[analyzed_file_view].[path]+N'#')+CONVERT_IMPLICIT(nvarchar(30),CONVERT(varchar(30),[qpsitools].[dbo].[analyzed_file_view].[revision],0),0)))
            |              |--Index Seek(OBJECT:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]), SEEK:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]=(1)) ORDERED FORWARD)
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1010]), WHERE:(PROBE([Bitmap1015],[Expr1010])))
                 |--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(nvarchar(254),[Blueprint].[dbo].[P4FileReleases].[FileRelease] as [p4f].[FileRelease],0)))
                      |--Index Scan(OBJECT:([Blueprint].[dbo].[P4FileReleases].[NCIX_P4FileReleases_FileRelease] AS [p4f]))

The following query performs badly because of a full non-clustered index scan of 6.5 million records in P4FileReleases followed by a hash join. I'm looking for possible reasons the optimizer picks a scan over a seek.

SELECT p4f.FileReleaseID 
   FROM P4FileReleases p4f
   INNER JOIN AnalyzedFileView af 
      ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))  
   WHERE (af.tracked_change_id = 1)

From what I can tell, I see no reason for the optimizer to pick a scan of P4FileReleases. The WHERE clause limits the size of the right dataset to about 1K of records and the optimizer should know it (see the histogram below).

If fact, if I take the view data and throw it into a heap table (same structure as the indexed view), then the query is performed with an index seek on the larger table and an inner join loop instead of a hash join (and the total cost drops from 145 to around 1).

Any ideas on what might be throwing the optimizer off?

Details. Sql Server 2008 (v. 10.0.2757.0).

P4FileReleases table
Holds 6.5 million records

CREATE TABLE [dbo].[P4FileReleases](
    [FileReleaseID] [int] IDENTITY(1,1) NOT NULL,
    [FileRelease] [varchar](254) NOT NULL,
    -- 5 more fields 
 CONSTRAINT [CIX_P4FileReleases_FileReleaseID_PK] PRIMARY KEY CLUSTERED 
(
    [FileReleaseID] ASC
),
CONSTRAINT [NCIX_P4FileReleases_FileRelease] UNIQUE NONCLUSTERED 
(
    [FileRelease] ASC
)

AnalyzedFileView
is an indexed view with statistics enabled and up-to-date.

It has four columns:

   key int (int, PK) - clustered index
   tracked_change_id (int, FK) - non-unique, non-clustered index (covering 'path', 'revision')
   path (nvarchar(1024), null) 
   revision (smallint, null)

tracked_change_id histogram:

1   0   1222    0   1
4   0   787     0   1
8   0   2754    0   1
12  0   254     0   1
13  0   34      0   1

Query Plan

  |--Parallelism(Gather Streams)
       |--Hash Match(Inner Join, HASH:([Expr1011])=([Expr1010]), RESIDUAL:([Expr1010]=[Expr1011]))
            |--Bitmap(HASH:([Expr1011]), DEFINE:([Bitmap1015]))
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
            |         |--Compute Scalar(DEFINE:([Expr1011]=([qpsitools].[dbo].[analyzed_file_view].[path]+N'#')+CONVERT_IMPLICIT(nvarchar(30),CONVERT(varchar(30),[qpsitools].[dbo].[analyzed_file_view].[revision],0),0)))
            |              |--Index Seek(OBJECT:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]), SEEK:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]=(1)) ORDERED FORWARD)
            |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1010]), WHERE:(PROBE([Bitmap1015],[Expr1010])))
                 |--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(nvarchar(254),[Blueprint].[dbo].[P4FileReleases].[FileRelease] as [p4f].[FileRelease],0)))
                      |--Index Scan(OBJECT:([Blueprint].[dbo].[P4FileReleases].[NCIX_P4FileReleases_FileRelease] AS [p4f]))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

翻身的咸鱼 2024-09-15 20:46:58

您正在将 varchar 列 p4f.FileRelease 与 nvarchar 列 (af.path) 连接起来。由于数据类型不匹配,SQL 必须将一个类型转换为另一个类型(当然它不能从 nvarchar 转换为 varchar)。在将 af.path 转换为 nvarchar 时,它失去了使用索引查找/过滤这些值的能力,导致需要扫描和转换所有可能的行。

最佳解决方案是将数据存储为匹配的数据类型(将列 p4f.FileRelase 更改为 nvarchar,或将 af.path 更改为 varchar)。由于没有人可以修改现有的数据库结构,因此一种解决方法可能是在查询中将 af.path 显式转换为 varchar。测试一下并看看...当然,如果数据确实需要双字节格式,您就不能这样做。

You are joining varchar column p4f.FileRelease with an nvarchar column (af.path). Since the data types don't match, SQL has to convert one's type to the other's (and of course it can't go from nvarchar to varchar). In converting af.path to nvarchar, it loses the ability to use the index to lookup/filter those values, resulting in the need to scan and convert all possible rows.

The best solution is to store the data as matching data types (change column p4f.FileRelase to nvarchar, or af.path to varchar). Since no one ever gets to modify existing database structures, a work-around might be to explicitly cast af.path to varchar in the query. Test it and see... though of course you can't do this if the data truly requires double-byte formatting.

浅语花开 2024-09-15 20:46:58

你的问题不是 WHERE 而是 JOIN,你正在获得隐式转换并在 JOIN 上进行扫描,在 WHERE 条件下你获得 SEEK

ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))  

并行性也可能是一个问题,请尝试添加 MAXDOP=1

您的统计信息是否是最新的?是否存在过度碎片化?

your problem is not the WHERE but the JOIN, you are getting an implicit conversion and a scan on the JOIN, on the WHERE condition you are getting a SEEK

ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))  

Parallelism could also be a problem, try adding MAXDOP=1

Are your statistics up to date? Is there excessive fragmentation?

尹雨沫 2024-09-15 20:46:58

尝试将“af.tracked_change_id = 1”移至连接子句中。

INNER JOIN AnalyzedFileView af 
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
AND af.tracked_change_id = 1

WHERE 在 INNER JOIN 之后应用

Try moving "af.tracked_change_id = 1" into the join clause.

INNER JOIN AnalyzedFileView af 
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
AND af.tracked_change_id = 1

WHERE is applied after the INNER JOIN

垂暮老矣 2024-09-15 20:46:58

菲利普·凯利发现了问题所在。这是 P4FileReleases 中的 varchar 与AnalysedFileView 中的 nvarchar 之间的数据类型不匹配。

Philip Kelley spotted the problem. It was a datatype mismatch between varchar in P4FileReleases and nvarchar in AnalyzedFileView.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文