避免 SQL Server 上急切的假脱机操作的方法
我有一个 ETL 过程,其中涉及一个存储过程,该过程大量使用 SELECT INTO 语句(最少记录,因此速度更快,因为它们生成的日志流量较少)。 在一个特定存储过程中发生的一批工作中,有几个最昂贵的操作是急切的假脱机,它们似乎只是缓冲查询结果,然后将它们复制到刚刚创建的表中。
关于 eager spool 的 MSDN 文档非常稀疏。 有谁更深入地了解这些是否真的有必要(以及在什么情况下)? 我有一些可能有意义也可能没有意义的理论,但没有成功地从查询中消除这些理论。
.sqlplan 文件非常大(160kb),因此我想将它们直接发布到论坛可能不合理。
因此,这里有一些可能适合特定答案的理论:
- 查询使用一些 UDF 进行数据转换,例如解析格式化日期。 此数据转换是否需要在构造表之前使用 eager spool 为表分配合理的类型(例如 varchar 长度)?
- 作为上述问题的扩展,是否有人对查询中驱动此操作或不驱动此操作的因素有更深入的了解?
I have an ETL process that involves a stored procedure that makes heavy use of SELECT INTO
statements (minimally logged and therefore faster as they generate less log traffic). Of the batch of work that takes place in one particular stored the stored procedure several of the most expensive operations are eager spools that appear to just buffer the query results and then copy them into the table just being made.
The MSDN documentation on eager spools is quite sparse. Does anyone have a deeper insight into whether these are really necessary (and under what circumstances)? I have a few theories that may or may not make sense, but no success in eliminating these from the queries.
The .sqlplan files are quite large (160kb) so I guess it's probably not reasonable to post them directly to a forum.
So, here are some theories that may be amenable to specific answers:
- The query uses some UDFs for data transformation, such as parsing formatted dates. Does this data transformation necessitate the use of eager spools to allocate sensible types (e.g. varchar lengths) to the table before it constructs it?
- As an extension of the question above, does anyone have a deeper view of what does or does not drive this operation in a query?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我对假脱机的理解是,它对你的执行计划来说有点转移注意力。 是的,它占了查询成本的很大一部分,但它实际上是 SQL Server 自动进行的优化,以便避免昂贵的重新扫描。 如果要避免假脱机,它所在的执行树的成本将会上升,并且几乎可以肯定整个查询的成本也会增加。 我对什么可能导致数据库的查询优化器以这种方式解析执行没有任何特别的见解,特别是在没有看到 SQL 代码的情况下,但您最好相信它的行为。
但是,这并不意味着您的执行计划无法优化,具体取决于您的具体情况以及源数据的波动性。 当您执行 SELECT INTO 时,您经常会在执行计划中看到假脱机项目,这可能与读隔离有关。 如果适合您的特定情况,您可以尝试将事务隔离级别降低到成本较低的水平,和/或使用
NOLOCK
提示。 我发现,在复杂的性能关键查询中,NOLOCK
如果安全且适合您的数据,可以极大地提高查询执行速度,即使似乎没有任何理由这样做。在这种情况下,如果您尝试 READ UNCOMMITTED 或 NOLOCK 提示,您也许能够消除一些 Spool。 (显然,如果这样做可能会让您处于不一致的状态,那么您显然不想这样做,但每个人的数据隔离要求都不同)。
TOP
运算符和OR
运算符偶尔会导致假脱机,但我怀疑您是否在 ETL 过程中执行这些操作......您说得对您的 UDF 也可能是罪魁祸首。 如果您只使用每个 UDF 一次,那么尝试将它们内联起来以查看是否可以获得较大的性能优势将是一个有趣的实验。 (如果您找不到将它们与查询内联写入的方法,这可能就是它们可能导致假脱机的原因)。
我要考虑的最后一件事是,如果您正在执行任何可以重新排序的联接,请尝试使用提示来强制联接顺序按照您所知的最具选择性的顺序进行。 虽然这个范围有点大,但如果您已经陷入优化困境,尝试一下也没什么坏处。
My understanding of spooling is that it's a bit of a red herring on your execution plan. Yes, it accounts for a lot of your query cost, but it's actually an optimization that SQL Server undertakes automatically so that it can avoid costly rescanning. If you were to avoid spooling, the cost of the execution tree it sits on will go up and almost certainly the cost of the whole query would increase. I don't have any particular insight into what in particular might cause the database's query optimizer to parse the execution that way, especially without seeing the SQL code, but you're probably better off trusting its behavior.
However, that doesn't mean your execution plan can't be optimized, depending on exactly what you're up to and how volatile your source data is. When you're doing a
SELECT INTO
, you'll often see spooling items on your execution plan, and it can be related to read isolation. If it's appropriate for your particular situation, you might try just lowering the transaction isolation level to something less costly, and/or using theNOLOCK
hint. I've found in complicated performance-critical queries thatNOLOCK
, if safe and appropriate for your data, can vastly increase the speed of query execution even when there doesn't seem to be any reason it should.In this situation, if you try
READ UNCOMMITTED
or theNOLOCK
hint, you may be able to eliminate some of the Spools. (Obviously you don't want to do this if it's likely to land you in an inconsistent state, but everyone's data isolation requirements are different). TheTOP
operator and theOR
operator can occasionally cause spooling, but I doubt you're doing any of those in an ETL process...You're right in saying that your UDFs could also be the culprit. If you're only using each UDF once, it would be an interesting experiment to try putting them inline to see if you get a large performance benefit. (And if you can't figure out a way to write them inline with the query, that's probably why they might be causing spooling).
One last thing I would look at is that, if you're doing any joins that can be re-ordered, try using a hint to force the join order to happen in what you know to be the most selective order. That's a bit of a reach but it doesn't hurt to try it if you're already stuck optimizing.