SQL“WITH”性能和临时表(可能使用“查询提示”来简化)
给出下面的示例查询(仅限简化示例)
DECLARE @DT int; SET @DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=@DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
和...
DECLARE @DT int; SET @DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=@DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
两者都会产生相同的结果,这是基于字段数据的列表中的有限且排名的值列表。
当这些查询变得相当复杂时(更多的表、大量的条件、多级“with”表别名等),底部查询的执行速度比顶部查询快得多。有时速度会快 20 倍到 100 倍。
问题是...
是否有某种查询提示或其他 SQL 选项可以告诉 SQL Server 自动执行相同类型的优化,或者是否有其他格式涉及更简洁的方法(尽量保持格式尽可能类似于查询 1)?
请注意,对于这个示例来说,“排名”或辅助查询只是无用的,执行的实际操作实际上并不重要。
这正是我所希望的(或类似的,但想法我希望很清楚)。请记住,下面的这个查询实际上不起作用。
DECLARE @DT int; SET @DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=@DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
编辑:重要的后续信息!
如果在子查询中添加
TOP 999999999 -- improves speed dramatically
您的查询,其行为方式将与在先前查询中使用临时表类似。我发现执行时间几乎以完全相同的方式得到改善。这比使用临时表要简单得多,基本上就是我想要的。
但是,
TOP 100 PERCENT -- does NOT improve speed
不以相同的方式执行(您必须使用静态数字样式 TOP 999999999 )
解释:
从我可以从两种格式的查询的实际执行计划中看出(原始格式具有正常的 CTE)每个子查询的 TOP 为 99999999)
普通查询将所有内容连接在一起,就好像所有表都在一个大规模查询中一样,这是预期的。过滤条件几乎应用于计划中的连接点,这意味着同时评估更多行并将其连接在一起。
在 TOP 999999999 的版本中,实际执行计划明确地将子查询与主查询分开,以便应用 TOP 语句操作,从而强制创建子查询的内存“位图”,然后将其连接到主查询询问。这似乎实际上完全符合我的要求,事实上,它甚至可能更高效,因为具有大量 RAM 的服务器将能够完全在内存中执行查询,而无需任何磁盘 IO。就我而言,我们有 280 GB 的 RAM,远远超出了实际使用的范围。
Given the example queries below (Simplified examples only)
DECLARE @DT int; SET @DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=@DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE @DT int; SET @DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=@DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE @DT int; SET @DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=@DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您不仅可以在临时表上使用索引,而且还允许使用统计信息和提示。我在 CTE 文档中找不到关于能够使用统计数据的参考资料,并且明确指出您不能使用提示。
当您有一个大型数据集时,即使您不使用索引(可能是因为它将使用统计信息来制定计划),在临时表和表变量之间进行选择,临时表通常也是最高效的方法,我可能会怀疑 CTE 的实现更像是表变量而不是临时表。
我认为最好的办法是看看执行计划有何不同,以确定它是否可以修复。
当您知道临时表性能更好时,您到底反对使用它吗?
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
问题在于,在第一个查询中,SQL Server 查询优化器能够生成查询计划。在第二个查询中,无法生成良好的查询计划,因为您要将值插入到新的临时表中。我的猜测是在您看不到的某个地方正在进行全表扫描。
您可能想要在第二个查询中执行的操作是将值插入到 #LargeData 临时表中,就像您已经做的那样,然后在“valuefield”列上创建非聚集索引。这可能有助于提高您的表现。
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
SQL 很可能针对错误的参数值进行优化。
有几个选项
尝试使用
选项(RECOMPILE)
。这样做会产生一定的成本,因为它每次都会重新编译查询,但如果需要不同的计划,则可能是值得的。您还可以尝试使用
OPTION(OPTIMIZE FOR @DT=SomeRepresentatvieValue)
问题是您选择了错误的值。请参阅 SQL 中的我闻到一个参数!服务器查询优化团队博客
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using
option(RECOMPILE)
. There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.You could also try using
OPTION(OPTIMIZE FOR @DT=SomeRepresentatvieValue)
The problem with this is you pick the wrong value.See I Smell a Parameter! from The SQL Server Query Optimization Team blog