JVM 在 RHEL 5.2 上的压力下崩溃

发布于 2024-08-21 09:18:06 字数 1868 浏览 6 评论 0原文

经过 ~~4 到 24 小时~~ 4 小时到 8 天的压力测试后，在（当前最新的）tomcat 6.0.24 上运行 Web 应用程序时，（当前最新的）jdk 1.6.0.18 意外崩溃（30 个线程点击应用程序，页面浏览量/天）。这是在 RHEL 5.2 (Tikanga) 上。

崩溃报告位于 http://pastebin.com/f639a6cf1 和一致部分崩溃的原因是：

抛出 SIGSEGV
libjvm 上
。因此 eden 空间始终已满 (100%)

JVM 使用以下选项运行：

CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"

我还使用 http://memtest.org/ 48 小时（整个内存的 14 次传递）没有任何错误。

我已启用 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 来检查任何 GC 趋势或空间耗尽，但没有任何可疑之处。 GC 和 full GC 以可预测的时间间隔发生，几乎总是释放相同数量的内存容量。

我的应用程序不直接使用任何本机代码。

我下一步应该去哪里有什么想法吗？

编辑 - 更多信息：

1）此 JDK 中没有客户端虚拟机：

[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

2）无法更改操作系统。

3）我不想更改JMeter压力测试变量，因为这可能会隐藏问题。由于我有一个使 JVM 崩溃的用例（当前的压力测试场景），我想修复崩溃而不更改测试。

4) 我已经对我的应用程序进行了静态分析，但没有出现任何严重的结果。

5）内存不会随着时间的推移而增长。内存使用量以非常稳定的趋势很快达到平衡（启动后），这似乎并不可疑。

6) /var/log/messages 在崩溃之前或期间不包含任何有用的信息

更多信息：忘记提及有一个使用 mod_jk 1.2 的 apache (2.2.14) fronting tomcat .28.现在我正在没有 apache 的情况下运行测试，以防 JVM 崩溃与连接到 JVM（tomcat 连接器）的 mod_jk 本机代码相关。

之后（如果 JVM 再次崩溃）我将尝试从应用程序中删除一些组件（缓存、lucene、quartz），稍后将尝试使用 jetty。由于目前崩溃发生在 4 小时到 8 天内的任何时间，因此可能需要很长时间才能查明发生了什么情况。

原文

I've got (the currently latest) jdk 1.6.0.18 crashing while running a web application on (the currently latest) tomcat 6.0.24 unexpectedly after ~~4 to 24 hours~~ 4 hours to 8 days of stress testing (30 threads hitting the app at 6 mil. pageviews/day). This is on RHEL 5.2 (Tikanga).

The crash report is at http://pastebin.com/f639a6cf1 and the consistent parts of the crash are:

a SIGSEGV is being thrown
on libjvm.so
eden space is always full (100%)

JVM runs with the following options:

CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"

I've also tested the memory for hardware problems using http://memtest.org/ for 48 hours (14 passes of the whole memory) without any error.

I've enabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to inspect for any GC trends or space exhaustion but there is nothing suspicious there. GC and full GC happens at predicable intervals, almost always freeing the same amount of memory capacities.

My application does not, directly, use any native code.

Any ideas of where I should look next?

Edit - more info:

1) There is no client vm in this JDK:

[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

2) Changing the O/S is not possible.

3) I don't want to change the JMeter stress test variables since this could hide the problem. Since I've got a use case (the current stress test scenario) which crashes the JVM I'd like to fix the crash and not change the test.

4) I've done static analysis on my application but nothing serious came up.

5) The memory does not grow over time. The memory usage equilibrates very quickly (after startup) at a very steady trend which does not seem suspicious.

6) /var/log/messages does not contain any useful information before or during the time of the crash

More info: Forgot to mention that there was an apache (2.2.14) fronting tomcat using mod_jk 1.2.28. Right now I'm running the test without apache just in case the JVM crash relates to the mod_jk native code which connects to JVM (tomcat connector).

After that (if JVM crashes again) I'll try removing some components from my application (caching, lucene, quartz) and later on will try using jetty. Since the crash is currently happening anytime between 4 hours to 8 days, it may take a lot of time to find out what's going on.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

音盲 2024-08-28 09:18:06

你有编译器输出吗？即 PrintCompilation （如果你感觉特别勇敢，可以使用 LogCompilation）。

我通过观察编译器正在做什么来调试这样的情况，最终（这花了很长时间直到灯泡时刻），意识到我的崩溃是由 oracle jdbc 驱动程序中的特定方法的编译引起的。

基本上我会做的是；

打开 PrintCompilation
因为它不会给出时间戳，编写一个脚本来监视该日志文件（例如每秒睡眠一次并打印新行）并报告方法何时编译（或未编译）
重复测试
检查编译器输出以查看是否崩溃对应于某些方法的编译
重复几次以查看是否存在模式

如果存在可辨别的模式，则使用 .hotspot_compiler （或 .hotspotrc）使其停止编译有问题的方法，重复测试并看看它不会爆炸。显然，在你的情况下，这个过程理论上可能需要几个月的时间。

处理日志编译输出的一些参考

--> http://wikis.sun.com/display/HotSpotInternals/LogCompilation+tool
有关 .hotspot_compiler 的信息 --> http://futuretask.blogspot。 com/2005/01/java-tip-7-use-hotspotcompiler-file-to.html 或 http://blogs.oracle.com/javawithjiva/entry/hotspotrc_and_hotspot_compiler
一个非常简单、快速的编译器。用于观察编译器输出的脏脚本 --> http://pastebin.com/Haqjdue9
请注意，这是为Solaris编写的，与utils相比，它总是有奇怪的选项到 gnu 等效项，因此毫无疑问在其他平台或使用不同语言上执行此操作的更简单方法

我要做的另一件事是系统地更改您正在使用的 gc 算法并检查 gc 的崩溃时间活动（例如，它与年轻的还是年老的GC相关，TLABs又如何？）。您的转储表明您正在使用并行清除，因此请尝试

串行（年轻）收集器（IIRC，它可以与并行旧收集器组合）
ParNew + CMS
G1，

如果它不会与不同的GC算法一起重复出现，那么您就知道它是这样的（并且您没有解决办法，只能更改 GC 算法和/或返回较旧的 JVM，直到找到不会崩溃的算法版本）。

Do you have compiler output? i.e. PrintCompilation (and if you're feeling particularly brave, LogCompilation).

I have debugged a case like this in the part by watching what the compiler is doing and, eventually (this took a long time until the light bulb moment), realising that my crash was caused by compilation of a particular method in the oracle jdbc driver.

Basically what I'd do is;

switch on PrintCompilation
since that doesn't give timestamps, write a script that watches that logfile (like a sleep every second and print new rows) and reports when methods were compiled (or not)
repeat the test
check the compiler output to see if the crash corresponds with compilation of some method
repeat a few more times to see if there is a pattern

If there is a discernable pattern then use .hotspot_compiler (or .hotspotrc) to make it stop compiling the offending method(s), repeat the test and see if it doesn't blow up. Obviously in your case this process could theoretically take months I'm afraid.

some references

for dealing with logcompilation output --> http://wikis.sun.com/display/HotSpotInternals/LogCompilation+tool
for info on .hotspot_compiler --> http://futuretask.blogspot.com/2005/01/java-tip-7-use-hotspotcompiler-file-to.html or http://blogs.oracle.com/javawithjiva/entry/hotspotrc_and_hotspot_compiler
a really simple, quick & dirty script for watching the compiler output --> http://pastebin.com/Haqjdue9
note that this was written for solaris which always has bizarre options to utils compared to the gnu equivalents so no doubt easier ways to do this on other platforms or using different languages

The other thing I'd do is systematically change the gc algorithm you're using and check the crash times against gc activity (e.g. does it correlate with a young or old gc, what about TLABs?). Your dump indicates you're using parallel scavenge so try

the serial (young) collector (IIRC it can be combined with a parallel old)
ParNew + CMS
G1

if it doesn't recur with the different GC algos then you know it's down to that (and you have no fix but to change GC algo and/or walk back through older JVMs until you find a version of that algo that doesn't blow).

回复收藏 0 原文

薄荷梦 2024-08-28 09:18:06

一些想法：

使用不同的 JDK、Tomcat 和/或操作系统版本
稍微修改测试参数，例如 25 个线程，每天 720 万页面浏览量
监控或分析内存使用情况
调试或调整垃圾收集器
运行静态和动态分析

回复收藏 0 原文

梨涡少年 2024-08-28 09:18:06

您尝试过不同的硬件吗？您似乎使用的是 64 位架构。根据我自己的经验，32 位更快、更稳定。也许某处也存在硬件问题。 “4-24 小时之间”的计时很分散，只是一个软件问题。虽然你确实说系统日志没有错误，所以我可能离题很远。还是觉得值得一试。

回复收藏 0 原文

叹沉浮 2024-08-28 09:18:06

你的记忆力会随着时间的推移而增长吗？如果是这样，我建议将内存限制更改得更低，以查看当内存耗尽时系统是否会更频繁地失败。

如果满足以下条件，您能否更快地重现该问题：

减少 JVM 可用的内存？
您减少了可用的系统资源（即耗尽系统内存，因此 JVM 没有足够的内存）
您将用例更改为更简单的模型？

我使用的主要策略之一是确定哪个用例导致了问题。这可能是一个通用问题，也可能是特定于用例的问题。尝试记录用例的开始和停止，看看是否可以确定哪些用例更有可能导致问题。如果将用例分成两半，请查看哪一半失败最快。这可能是更常见的失败原因。当然，对每种配置进行一些试验会提高测量的准确性。

我还知道要么更改服务器以执行很少的工作，要么循环执行服务器正在执行的工作。一个使您的应用程序代码工作更加困难，另一个使 Web 服务器和应用程序服务器工作更加困难。

祝你好运，
雅各布

回复收藏 0 原文

冬天的雪花 2024-08-28 09:18:06

尝试将 Servlet 容器从 Tomcat 切换到 Jetty http://jetty.codehaus.org/jetty/ 。

回复收藏 0 原文

你的背包 2024-08-28 09:18:06

如果我是您，我会执行以下操作：

尝试稍旧的 Tomcat/JVM 版本。你似乎正在运行最新、最好的。我会使用两个版本左右，可能会尝试 JRockit JVM。
在应用程序运行时执行线程转储（kill -3 java_pid）以查看完整堆栈。您当前的转储显示大量线程被阻塞 - 但不清楚它们在哪里阻塞（I/O？一些内部锁匮乏？还有其他什么吗？）。我什至可能会安排 kill -3 每分钟运行一次，以将任何随机线程转储与崩溃之前的线程转储进行比较。
我见过 Linux JDK 死掉而 Windows JDK 能够优雅地捕获异常（当时是 StackOverflowException）的情况，因此，如果您可以修改代码，请在顶级类中的某个位置添加“catch Throwable”。万一。
使用 GC 调整选项。打开/关闭并发GC，调整NewSize/MaxNewSize。是的，这不科学——而是迫切需要有效的解决方案。更多详细信息请参见：http://java.sun.com/javase/ Technologies/hotspot/gc/gc_tuning_6.html

让我们知道这是如何解决的！

回复收藏 0 原文