编译器完成的配置文件引导优化是否会明显损害分析数据集未涵盖的情况?

发布于 2024-12-10 20:01:00 字数 560 浏览 0 评论 0原文

这个问题不是特定于 C++ 的,据我所知某些运行时(如 Java RE)可以动态进行配置文件引导优化,我对此也很感兴趣。

这样描述 PGO

  1. MSDN 检测我的程序并在探查器下运行它,然后
  2. 编译器使用探查器收集的数据自动重新组织分支和循环,从而减少分支错误预测,并且最常见的运行代码被紧凑地放置改善其局部性

现在显然分析结果将取决于所使用的数据集。

通过正常的手动分析和优化,我会发现一些瓶颈并改进这些瓶颈,并且可能不会影响所有其他代码。 PGO 似乎改进了经常运行的代码,但代价是使很少运行的代码变慢。

现在,如果速度较慢的代码经常在程序在现实世界中看到的另一个数据集上运行怎么办?与不使用 PGO 编译的程序相比,程序性能是否会下降?下降的程度可能有多大?换句话说,PGO 是否真的提高了我的分析数据集的代码性能,并可能使其他数据集的代码性能恶化?有没有真实的例子和真实的数据?

This question is not specific to C++, AFAIK certain runtimes like Java RE can do profiled-guided optimization on the fly, I'm interested in that too.

MSDN describes PGO like this:

  1. I instrument my program and run it under profiler, then
  2. the compiler uses data gathered by profiler to automatically reorganize branching and loops in such way that branch misprediction is reduced and most often run code is placed compactly to improve its locality

Now obviously profiling result will depend on a dataset used.

With normal manual profiling and optimization I'd find some bottlenecks and improve those bottlenecks and likely leave all the other code untouched. PGO seems to improve often run code at expense of making rarely run code slower.

Now what if that slowered code is run often on another dataset that the program will see in real world? Will the program performance degrade compared to a program compiled without PGO and how bad will the degradation likely be? In other word, does PGO really improve my code performance for the profiling dataset and possibly worsen it for other datasets? Are there any real examples with real data?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

瑾兮 2024-12-17 20:01:00

免责声明:我没有对 PGO 做过更多的事情,只是阅读它并通过一个示例项目尝试一下以获取乐趣。以下很多内容都是基于我对“非 PGO”优化的经验和有根据的猜测。 TL;DR 下面。

此页面列出了 PGO 所做的优化。让我们一一看看它们(按影响分组):

内联例如,如果存在频繁调用函数 B 的函数 A,并且函数 B 相对较小,则配置文件引导优化会将函数 B 内联到函数 A 中。

寄存器分配 - 使用配置文件数据进行优化可实现更好的寄存器分配。

虚拟调用推测 - 如果虚拟调用或通过函数指针进行的其他调用经常以某个函数为目标,则配置文件引导的优化可以插入有条件执行的直接调用频繁定位的函数,可以内联直接调用。

这显然改善了对某些优化是否有效的预测。对于非配置文件的代码路径没有直接的权衡。


基本块优化基本块优化允许在给定帧内临时执行的常用基本块放置在同一组页面(局部性)中。这可以最大限度地减少使用的页面数量,从而最大限度地减少内存开销。

函数布局根据调用图和分析的调用者/被调用者行为,倾向于沿着相同执行路径的函数被放置在同一部分中。< /p>

死代码分离 - 在分析过程中未调用的代码被移动到附加到部分集末尾的特殊部分。这有效地将该部分排除在常用页面之外。

EH 代码分离当配置文件引导的优化可以确定异常仅在异常情况下发生时,异常执行的 EH 代码通常可以移动到单独的部分。< /em>


所有这些都可能会减少未分析代码路径的局部性。根据我的经验,如果此代码路径具有超出 L1 代码缓存(甚至可能会影响 L2)的紧密循环,则影响将是显着或严重的。这听起来就像应该包含在 PGO 配置文件中的路径:)

死代码分离可以产生巨大的影响 - 双向 - 因为它可以减少磁盘访问。

如果您依赖异常的速度,那么您就做错了。


大小/速度优化 - 程序花费大量时间的函数可以针对速度进行优化。

现在的经验法则是“默认情况下针对大小进行优化,并且只针对大小进行优化”。在需要时优化速度(并验证它是否有帮助)。原因再次是代码缓存 - 在大多数情况下,由于代码缓存,较小的代码也将是更快的代码,因此与手动执行的操作相比,这种自动化。对于全局速度优化,这会减慢仅在非常非典型的情况下(“奇怪的代码”或具有异常缓存行为的目标计算机)才使用非分析代码路径。


条件分支优化 - 通过值探测,配置文件引导的优化可以发现 switch 语句中的给定值是否比其他值更频繁地使用。然后可以从 switch 语句中提取该值。使用 if/else 指令也可以完成同样的操作,其中优化器可以对 if/else 进行排序,以便根据哪个块更频繁地为真,将 if 或 else 块放在第一位。

我会将其归档在“改进的”下预测”,除非你提供了错误的 PGO 信息。

这可以付出很多代价的典型情况是运行时参数/范围验证以及在正常执行中永远不应该采用的类似路径。

破坏情况是:

if (x > 0) DoThis() else DoThat();

在相关的紧密循环中并且仅分析 x > 。 0 例。


内存内部函数 - 如果可以确定内部函数是否被频繁调用,则可以更好地决定内部函数的扩展。还可以根据移动或复制的块大小来优化内在函数。

同样,主要是更好的信息,而惩罚未经测试的数据的可能性很小。

示例: - 这都是“有根据的猜测”,但我认为这对于整个主题来说很有说明性。

假设您有一个 memmove,它始终在长度为 16 字节的对齐良好的非重叠缓冲区上调用。

一种可能的优化是验证这些条件,并针对这种情况使用内联 MOV 指令,仅在不满足条件时调用通用 memmove(处理对齐、重叠和奇数长度)。

当您改进局部性、减少预期路径指令时,在复制结构的紧密循环中,好处可能是显着的,可能有更多的配对/重新排序的机会。

不过,惩罚相对较小:在没有 PGO 的一般情况下,您要么总是调用完整的 memmove,要么对完整的 memmove 实现进行内联。优化为相当复杂的东西添加了一些指令(包括条件跳转),我假设最多有 10% 的开销。在大多数情况下,这 10% 将低于由于缓存访问而产生的噪声。

然而,如果经常采取意外分支并且预期情况的附加指令与默认情况的指令一起将紧密循环推出 L1,则有很小的机会产生重大影响代码缓存

请注意,您已经达到了编译器能为您做的事情的极限。与代码缓存中的几 K 相比,额外的指令预计只有几个字节。静态优化器可能会遭遇同样的命运,具体取决于它提升不变量的能力以及你允许它的程度。


结论:

  • 许多优化都是中性的。
  • 一些优化可能会对非分析代码路径产生轻微的负面影响 影响
  • 通常远小于可能的收益
  • 极少数情况下,小的影响会被其他致病因素所强调
  • 很少有优化(即代码部分的布局)可以产生大的影响影响,但可能的收益再次显着超过

我的直觉会进一步声称,

  • 静态优化器,总体而言,至少同样有可能创建病态案例,
  • 实际上很难摧毁 性能甚至与PGO 输入错误。

在这个级别上,我更担心 PGO 实现 的错误/缺点,而不是失败的 PGO 优化。

Disclaimer: I have not done more with PGO than read up on it and tried it once with a sample project for fun. A lot of the following is based on my experience with the "non-PGO" optimizations and educated guesses. TL;DR below.

This page lists the optimizations done by PGO. Lets look at them one-by-one (grouped by impact):

InliningFor example, if there exists a function A that frequently calls function B, and function B is relatively small, then profile-guided optimizations will inline function B in function A.

Register AllocationOptimizing with profile data results in better register allocation.

Virtual Call SpeculationIf a virtual call, or other call through a function pointer, frequently targets a certain function, a profile-guided optimization can insert a conditionally-executed direct call to the frequently-targeted function, and the direct call can be inlined.

These apparently improves the prediction whether or not some optimizations pay off. No direct tradeoff for non-profiled code paths.


Basic Block OptimizationBasic block optimization allows commonly executed basic blocks that temporally execute within a given frame to be placed in the same set of pages (locality). This minimizes the number of pages used, thus minimizing memory overhead.

Function LayoutBased on the call graph and profiled caller/callee behavior, functions that tend to be along the same execution path are placed in the same section.

Dead Code SeparationCode that is not called during profiling is moved to a special section that is appended to the end of the set of sections. This effectively keeps this section out of the often-used pages.

EH Code SeparationThe EH code, being exceptionally executed, can often be moved to a separate section when profile-guided optimizations can determine that the exceptions occur only on exceptional conditions.

All of this may reduce locality of non-profiled code paths. In my experience, the impact would be noticable or severe if this code path has a tight loop that does exceed L1 code cache (and maybe even thrashes L2). That sounds exactly like a path that should have been included in a PGO profile :)

Dead Code separation can have a huge impact - both ways - because it can reduce disk access.

If you rely on exceptions being fast, you are doing it wrong.


Size/Speed OptimizationFunctions where the program spends a lot of time can be optimized for speed.

The rule of thumb nowadays is to "optimize for size by default, and only optimize for speed where needed (and verify it helps). The reason is again code cache - in most cases, the smaller code will also be the faster code, because of code cache. So this kind of automates what you should do manually. Compared to a global speed optimization, this would slow down non-profiled code paths only in very atypical cases ("weird code" or a target machine with unusual cache behavior).


Conditional Branch OptimizationWith the value probes, profile-guided optimizations can find if a given value in a switch statement is used more often than other values. This value can then be pulled out of the switch statement. The same can be done with if/else instructions where the optimizer can order the if/else so that either the if or else block is placed first depending on which block is more frequently true.

I would file that under "improved prediction", too, unless you feed the wrong PGO information.

The typical case where this can pay a lot are run time parameter / range validation and similar paths that should never be taken in a normal execution.

The breaking case would be:

if (x > 0) DoThis() else DoThat();

in a relevant tight loop and profiling only the x > 0 case.


Memory IntrinsicsThe expansion of intrinsics can be decided better if it can be determined if an intrinsic is called frequently. An intrinsic can also be optimized based on the block size of moves or copies.

Again, mostly better informaiton with a small possibility of penalizing untested data.

Example: - this is all an "educated guess", but I think it's quite illustrativefor the entire topic.

Assume you have a memmove that is always called on well aligned non-overlapping buffers with a length of 16 bytes.

A possible optimization is verifying these conditions and use inlined MOV instructions for this case, calling to a general memmove (handling alignment, overlap and odd length) only when the conditions are not met.

The benefits can be significant in a tight loop of copying structs around, as you improve locality, reduce expected path instruction, likely with more chances for pairing/reordering.

The penalty is comparedly small, though: in the general case without PGO, you would either always call the full memmove, or nline the full memmove implementation. The optimization adds a few instructions (including a conditional jump) to something rather complex, I'd assume a 10% overhead at most. In most cases, these 10% will be below the noise due to cache access.

However, there is a very slight slight chance for significant impact if the unexpected branch is taken frequently and the additional instructions for the expected case together with the instructions for the default case push a tight loop out of the L1 code cache

Note that you are already at the limits of what the compiler could do for you. The additional instructions can be expected to be a few bytes, compared to a few K in code cache. A static optimizer could hit the same fate depending on how well it can hoist invariants - and how much you let it.


Conclusion:

  • Many of the optimizations are neutral.
  • Some optimizations can have slight negative impact on non-profiled code paths
  • The impact is usually much smaller than the possible gains
  • Very rarely, a small impact can be emphasized by other contributing pathological factors
  • Few optimizations (namely, layout of code sections) can have large impact, but again the possible gains signidicantly outweight that

My gut feel would further claim that

  • A static optimizer, on a whole, would be at least equally likely to create a pathological case
  • It would be pretty hard to actually destroy performance even with bad PGO input.

At that level, I would be much more afraid of PGO implementation bugs/shortcomings than of failed PGO optimizations.

菩提树下叶撕阳。 2024-12-17 20:01:00

PGO 肯定会影响运行频率较低的代码的运行时间。毕竟,您正在修改某些函数/块的局部性,这将使现在在一起的块更加缓存友好。

我所看到的是,团队确定了他们的高优先级场景。然后他们运行这些来训练优化分析器并衡量改进情况。您不想运行 PGO 下的所有场景,因为如果这样做,您还不如不运行任何场景。

正如与性能相关的所有事情一样,您需要在应用之前进行衡量。检查您最常见的场景,看看它们是否通过使用 PGO 培训得到了改善。还要衡量不太常见的情况,看看它们是否有回归。

PGO can most certainly affect the run time of the code that is run less frequently. After all you are modifying the locality of some functions/blocks and that will make the blocks that are now together to be more cache friendly.

What I have seen is that teams identify their high priority scenarios. Then they run those to train the optimization profiler and measure the improvement. You don't want to run all the scenarios under PGO because if you do you might as well not run any.

As in everything related to performance you need to measure before you apply it. Masure your most common scenarios to see if they improved at all by using PGO training. And also measure the less common scenarios to see if they regressed at all.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文