RBAR 与具有不同事务大小的基于集的处理的性能比较
传统观点认为,基于集合的表处理应始终优于 RBAR,尤其是当表变得更大和/或需要更新许多行时。
但这总是成立吗?我经历过很多情况 - 在不同的硬件上 - 基于集的处理显示时间消耗呈指数增长,而将相同的工作负载分割成更小的块会带来线性增长。
我认为,要么被证明是完全错误的——如果我遗漏了一些明显的东西——或者,如果没有,那么知道什么时候分配工作量值得付出努力,这将是非常有趣的。随后确定哪些指标有助于决定使用哪种方法。我个人预计以下组件会很有趣:
- 工作负载的大小
- 日志文件的大小和增长
- RAM 的数量
- 磁盘系统的速度
还有其他吗? CPU 数量/CPU 核心数量?
示例 1:我有一个 1200 万行的表,我必须使用另一个表中的数据更新每行中的一两个字段。如果我通过一次简单的更新来完成此操作,则在我的测试盒上大约需要 30 分钟。但如果我将其分成 12 个块,我将在大约 24 分钟内完成 - 即:
WHERE <key> BETWEEN 0 AND 1000000
WHERE <key> BETWEEN 1000000 AND 2000000
...
示例 2:是一个 200 多百万行的表,还需要对几乎所有行进行多次计算。如果一次完成全套,我的盒子将运行三天,甚至无法完成。如果我编写一个简单的 C# 来执行完全相同的 SQL,但附加 WHERE 子句以将事务大小限制为一次 100k 行,则大约 14 小时内即可完成。
郑重声明:我的结果来自相同的数据库,位于相同的物理硬件上,更新了统计数据,索引没有变化,简单的恢复模型等。
不,我还没有尝试过“真正的”RBAR,尽管我可能应该——尽管只是为了看看这到底需要多长时间。
It is conventional wisdom that set based processing of tables should always be preferred over RBAR - especially when the tables grow larger and/or you need to update many rows.
But does that always hold? I have experienced quite a few situations - on different hardware - where set-based processing shows exponential growth in time consumption, while splitting the same workload into smaller chunks gives linear growth.
I think it would be interesting either to be proven totally wrong - if I'm missing something obvious - or, if not, it would be very good to know when splitting the workload is worth the effort. And subsequently identifying what indicators help make the decision of which approach to use. I'm personally expecting the following components to be interesting:
- Size of workload
- Size and growth of logfile
- Amount of RAM
- Speed of disksystem
Any other? Number of CPUs/CPU cores?
Example 1: I have a 12 million row table and I have to update one or two fields in each row with data from another table. If I do this in one simple UPDATE, this takes ~30 minutes on my test box. But I'll be done in ~24 minutes if I split this into twelve chunks - ie.:
WHERE <key> BETWEEN 0 AND 1000000
WHERE <key> BETWEEN 1000000 AND 2000000
...
Example 2: Is a 200+ million rows table that also need to have several calculations done to practically all rows. If a do the full set all in one, my box will run for three days and not even then be done. If I write a simple C# to execute the exact same SQL, but with WHERE-clauses appended to limit transaction size to 100k rows at a time, it'll be done in ~14 hours.
For the record: My results are from the same databases, resting on the same physical hardware, with statistics updated, no changes in indexes, Simple recovery model, etc.
And no, I haven't tried 'true' RBAR, although I probably should - even though it would only be to see how long that would really take.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不,没有规则表明基于集合总是更快。我们使用游标是有原因的(不要误以为 while 循环或其他类型的循环实际上与游标完全不同)。 Itzik Ben-Gan 演示了一些游标效果更好的情况,特别是对于运行总计问题。您还描述了在某些情况下,您尝试更新 1200 万行,并且由于内存限制、日志使用或其他原因,SQL 无法将其作为单个操作来处理,而无需溢出到 tempdb 或解决由于没有足够快地获得更优化的计划而提前终止的次优计划。
游标受到不好评价的原因之一是人们很懒,只是说:
当他们几乎总是应该说:
这是因为这些额外的关键字由于各种原因使游标更加高效。根据文档,您可能会认为其中一些选项是多余的,但在我的测试中情况并非如此。请参阅 我的这篇博文和此博客文章来自 SQL 同行服务器 MVP Hugo Kornelis 了解更多详细信息。
话虽如此,在大多数情况下,您的最佳选择将是基于集合的(或者至少是基于大集合的,如上所述)。但对于一次性管理任务(我希望您的 1200 万行更新是这样),有时只编写游标比花费大量精力构建生成适当计划的最佳查询更容易/更有效。对于将在应用程序范围内作为正常操作大量运行的查询,值得付出更多努力来尝试基于集进行优化(请记住,您可能仍然会得到游标)。
No, there is no rule that set-based is always faster. We have cursors for a reason (and don't be fooled into believing that a while loop or some other type of looping is really all that different from a cursor). Itzik Ben-Gan has demonstrated a few cases where cursors are much better, particularly for running totals problems. There are also cases you describe where you're trying to update 12 million rows and due to memory constraints, log usage or other reasons it's just too much for SQL to handle as a single operation without having to spill to tempdb, or settle on a sub-optimal plan from early termination due to not getting a more optimal plan quick enough.
One of the reasons cursors get a bad rap is that people are lazy and just say:
When they almost always should be saying:
This is because those extra keywords make the cursor more efficient for various reasons. Based on the documentation you would expect some of those options to be redundant, but in my testing this is not the case. See this blog post of mine and this blog post from fellow SQL Server MVP Hugo Kornelis for more details.
That all said, in most cases your best bet is going to be set-based (or at least chunky set-based as you described above). But for one-off admin tasks (which I hope your 12-million row update is), it is sometimes easier / more efficient to just write a cursor than to spend a lot of effort constructing an optimal query that produces an adequate plan. For queries that will be run a lot as normal operations within the scope of the application, those are worth more effort to try to optimize as set-based (keeping in mind that you may still end up with a cursor).