当服务器 JIT 激活时，什么会导致我的代码运行速度变慢？

发布于 2024-09-03 13:31:04 字数 1201 浏览 3 评论 0原文

我正在对 MPEG 解码器进行一些优化。为了确保我的优化不会破坏任何东西，我有一个测试套件，可以对整个代码库（优化的和原始的）进行基准测试，并验证它们是否产生相同的结果（基本上只是通过解码器和 crc32 提供几个不同的流）输出）。

当在 Sun 1.6.0_18 上使用“-server”选项时，测试套件在预热后在优化版本上运行速度降低了约 12%（与默认的“-client”设置相比），而原始代码库则获得了良好的提升运行速度大约是客户端模式的两倍。

虽然起初这对我来说似乎只是一个热身问题，但我添加了一个循环来多次重复整个测试套件。然后，从测试的第三次迭代开始，每次传递的执行时间变得恒定，优化版本仍然比客户端模式慢 12%。

我也很确定这不是垃圾收集问题，因为代码在启动后绝对不涉及对象分配。该代码主要由一些位操作操作（流解码）和大量基本浮点数学（生成 PCM 音频）组成。唯一涉及的 JDK 类是 ByteArrayInputStream（将流提供给测试并从测试中排除磁盘 IO）和 CRC32（以验证结果）。我还在 Sun JDK 1.7.0_b98 中观察到了相同的行为（只是 15% 而不是 12%）。哦，测试都是在同一台机器（单核）上完成的，没有运行其他应用程序（WinXP）。虽然测量的执行时间（使用 System.nanoTime 顺便说一句）存在一些不可避免的变化，但具有相同设置的不同测试运行之间的变化从未超过 2%，通常小于 1%（预热后），所以我得出的结论是效果是真实的而不是纯粹由测量机构/机器引起的。

是否有任何已知的编码模式在服务器 JIT 上表现较差？如果做不到这一点，有哪些选项可以“窥视”幕后并观察 JIT 正在做什么？

也许我用错了“热身”描述。没有明确的预热代码。整个测试套件（由 12 个不同的 mpeg 流组成，总共包含约 180K 音频帧）执行 10 次，我将前 3 次运行视为“热身”。在我的机器上，一轮测试大约需要 40 秒，CPU 达到 100%。
我按照建议使用了 JVM 选项，并使用“-Xms512m -Xmx512m -Xss128k -server -XX:CompileThreshold=1 -XX:+PrintCompilation -XX:+AggressiveOpts -XX:+PrintGC”我可以验证所有编译发生在前 3 轮。垃圾收集每 3-4 轮进行一次，最多花费 40 毫秒（512m 非常大，因为测试可以用 16m 运行）。由此我得出结论，垃圾收集在这里没有影响。尽管如此，比较客户端和服务器（其他选项不变），仍然存在 12/15% 的差异。

原文

I am doing some optimizations on an MPEG decoder. To ensure my optimizations aren't breaking anything I have a test suite that benchmarks the entire codebase (both optimized and original) as well as verifying that they both produce identical results (basically just feeding a couple of different streams through the decoder and crc32 the outputs).

When using the "-server" option with the Sun 1.6.0_18, the test suite runs about 12% slower on the optimized version after warmup (in comparison to the default "-client" setting), while the original codebase gains a good boost running about twice as fast as in client mode.

While at first this seemed to be simply a warmup issue to me, I added a loop to repeat the entire test suite multiple times. Then execution times become constant for each pass starting at the 3rd iteration of the test, still the optimized version stays 12% slower than in the client mode.

I am also pretty sure its not a garbage collection issue, since the code involves absolutely no object allocations after startup. The code consists mainly of some bit manipulation operations (stream decoding) and lots of basic floating math (generating PCM audio). The only JDK classes involved are ByteArrayInputStream (feeds the stream to the test and excluding disk IO from the tests) and CRC32 (to verify the result). I also observed the same behaviour with Sun JDK 1.7.0_b98 (only that ist 15% instead of 12% there).
Oh, and the tests were all done on the same machine (single core) with no other applications running (WinXP). While there is some inevitable variation on the measured execution times (using System.nanoTime btw), the variation between different test runs with the same settings never exceeded 2%, usually less than 1% (after warmup), so I conclude the effect is real and not purely induced by the measuring mechanism/machine.

Are there any known coding patterns that perform worse on the server JIT? Failing that, what options are available to "peek" under the hood and observe what the JIT is doing there?

Maybe I misworded my "warmup" description. There is no explicit warmup code. The entire test suite (consisting of 12 different mpeg streams, containing ~180K audio frames total) is executed 10 times, and I regard the first 3 runs as "warmup". One test round takes approximately 40 seconds of 100% cpu on my machine.
I played with the JVM options as suggested and using "-Xms512m -Xmx512m -Xss128k -server -XX:CompileThreshold=1 -XX:+PrintCompilation -XX:+AggressiveOpts -XX:+PrintGC" I could verify that all compilation takes place in the first 3 rounds. Garbage collection kicks in every 3-4 rounds and took 40ms at most (512m is extremely oversized, since the tests can be run with 16m just fine). From this I conclude that garbage collection has no impact here. Still, comparing client to server (other options unaltered) the 12/15% difference remains.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情魔剑神 2024-09-10 13:31:04

正如您所看到的，JIT 可能会扭曲测试结果，因为它在后台线程中运行，从运行测试的主线程中窃取 CPU 周期。

除了窃取周期之外，它也是异步的，因此当您完成热身并开始真正的测试时，您无法确定它是否已完成工作。要强制同步 JIT 编译，您可以使用 -XBatch 非标准选项强制 JIT 编译到前台线程，这样您就可以确保在预热完成时 JIT 已完成。

HotSpot 不会立即编译方法，而是等待方法执行一定次数后才编译。在 -XX 选项的页面上，它指出了默认值-server 为 10000 倍，而 -client 为 1500 倍。这可能是一个原因
速度减慢，特别是如果您的预热最终调用了 1500 到 10000 次之间的许多关键方法：使用 -client 选项，它们将在预热阶段进行 JIT，但使用 -server 运行时，编译可能会延迟执行您的分析代码。

您可以通过设置-XX:CompileThreshold来更改HotSpot编译方法之前所需的方法调用次数。我选择二十个，这样即使测试只运行几次，在预热期间也会转换模糊的热点（不冷不热的点？）。这在过去对我有用，但 YMMV 和不同的价值观可能会给你更好的结果。

您还可以检查 HotSpot VM 选项页面以查找 -client 和 -server 选项之间不同的其他选项，特别是垃圾收集器选项，因为它们差异很大。

请参阅