JVM 在 RHEL 5.2 上的压力下崩溃

发布于 2024-08-21 09:18:06 字数 1868 浏览 6 评论 0原文

经过 4 到 24 小时 4 小时到 8 天的压力测试后,在(当前最新的)tomcat 6.0.24 上运行 Web 应用程序时,(当前最新的)jdk 1.6.0.18 意外崩溃(30 个线程点击应用程序,页面浏览量/天)。这是在 RHEL 5.2 (Tikanga) 上。

崩溃报告位于 http://pastebin.com/f639a6cf1一致部分崩溃的原因是:

  • 抛出 SIGSEGV
  • libjvm 上
  • 。因此 eden 空间始终已满 (100%)

JVM 使用以下选项运行:

CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"

我还使用 http://memtest.org/ 48 小时(整个内存的 14 次传递)没有任何错误。

我已启用 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 来检查任何 GC 趋势或空间耗尽,但没有任何可疑之处。 GC 和 full GC 以可预测的时间间隔发生,几乎总是释放相同数量的内存容量。

我的应用程序不直接使用任何本机代码。

我下一步应该去哪里有什么想法吗?

编辑 - 更多信息

1)此 JDK 中没有客户端虚拟机:

[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

2)无法更改操作系统。

3)我不想更改JMeter压力测试变量,因为这可能会隐藏问题。由于我有一个使 JVM 崩溃的用例(当前的压力测试场景),我想修复崩溃而不更改测试。

4) 我已经对我的应用程序进行了静态分析,但没有出现任何严重的结果。

5)内存不会随着时间的推移而增长。内存使用量以非常稳定的趋势很快达到平衡(启动后),这似乎并不可疑。

6) /var/log/messages 在崩溃之前或期间不包含任何有用的信息

更多信息:忘记提及有一个使用 mod_jk 1.2 的 apache (2.2.14) fronting tomcat .28.现在我正在没有 apache 的情况下运行测试,以防 JVM 崩溃与连接到 JVM(tomcat 连接器)的 mod_jk 本机代码相关。

之后(如果 JVM 再次崩溃)我将尝试从应用程序中删除一些组件(缓存、lucene、quartz),稍后将尝试使用 jetty。由于目前崩溃发生在 4 小时到 8 天内的任何时间,因此可能需要很长时间才能查明发生了什么情况。

I've got (the currently latest) jdk 1.6.0.18 crashing while running a web application on (the currently latest) tomcat 6.0.24 unexpectedly after 4 to 24 hours 4 hours to 8 days of stress testing (30 threads hitting the app at 6 mil. pageviews/day). This is on RHEL 5.2 (Tikanga).

The crash report is at http://pastebin.com/f639a6cf1 and the consistent parts of the crash are:

  • a SIGSEGV is being thrown
  • on libjvm.so
  • eden space is always full (100%)

JVM runs with the following options:

CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"

I've also tested the memory for hardware problems using http://memtest.org/ for 48 hours (14 passes of the whole memory) without any error.

I've enabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to inspect for any GC trends or space exhaustion but there is nothing suspicious there. GC and full GC happens at predicable intervals, almost always freeing the same amount of memory capacities.

My application does not, directly, use any native code.

Any ideas of where I should look next?

Edit - more info:

1) There is no client vm in this JDK:

[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

2) Changing the O/S is not possible.

3) I don't want to change the JMeter stress test variables since this could hide the problem. Since I've got a use case (the current stress test scenario) which crashes the JVM I'd like to fix the crash and not change the test.

4) I've done static analysis on my application but nothing serious came up.

5) The memory does not grow over time. The memory usage equilibrates very quickly (after startup) at a very steady trend which does not seem suspicious.

6) /var/log/messages does not contain any useful information before or during the time of the crash

More info: Forgot to mention that there was an apache (2.2.14) fronting tomcat using mod_jk 1.2.28. Right now I'm running the test without apache just in case the JVM crash relates to the mod_jk native code which connects to JVM (tomcat connector).

After that (if JVM crashes again) I'll try removing some components from my application (caching, lucene, quartz) and later on will try using jetty. Since the crash is currently happening anytime between 4 hours to 8 days, it may take a lot of time to find out what's going on.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

音盲 2024-08-28 09:18:06

你有编译器输出吗?即 PrintCompilation (如果你感觉特别勇敢,可以使用 LogCompilation)。

我通过观察编译器正在做什么来调试这样的情况,最终(这花了很长时间直到灯泡时刻),意识到我的崩溃是由 oracle jdbc 驱动程序中的特定方法的编译引起的。

基本上我会做的是;

  • 打开 PrintCompilation
  • 因为它不会给出时间戳,编写一个脚本来监视该日志文件(例如每秒睡眠一次并打印新行)并报告方法何时编译(或未编译)
  • 重复测试
  • 检查编译器输出以查看是否崩溃对应于某些方法的编译
  • 重复几次以查看是否存在模式

如果存在可辨别的模式,则使用 .hotspot_compiler (或 .hotspotrc)使其停止编译有问题的方法,重复测试并看看它不会爆炸。显然,在你的情况下,这个过程理论上可能需要几个月的时间。

处理日志编译输出的一些参考

我要做的另一件事是系统地更改您正在使用的 gc 算法检查 gc 的崩溃时间活动(例如,它与年轻的还是年老的GC相关,TLABs又如何?)。您的转储表明您正在使用并行清除,因此请尝试

  • 串行(年轻)收集器(IIRC,它可以与并行旧收集器组合)
  • ParNew + CMS
  • G1,

如果它不会与不同的GC算法一起重复出现,那么您就知道它是这样的(并且您没有解决办法,只能更改 GC 算法和/或返回较旧的 JVM,直到找到不会崩溃的算法版本)。

Do you have compiler output? i.e. PrintCompilation (and if you're feeling particularly brave, LogCompilation).

I have debugged a case like this in the part by watching what the compiler is doing and, eventually (this took a long time until the light bulb moment), realising that my crash was caused by compilation of a particular method in the oracle jdbc driver.

Basically what I'd do is;

  • switch on PrintCompilation
  • since that doesn't give timestamps, write a script that watches that logfile (like a sleep every second and print new rows) and reports when methods were compiled (or not)
  • repeat the test
  • check the compiler output to see if the crash corresponds with compilation of some method
  • repeat a few more times to see if there is a pattern

If there is a discernable pattern then use .hotspot_compiler (or .hotspotrc) to make it stop compiling the offending method(s), repeat the test and see if it doesn't blow up. Obviously in your case this process could theoretically take months I'm afraid.

some references

The other thing I'd do is systematically change the gc algorithm you're using and check the crash times against gc activity (e.g. does it correlate with a young or old gc, what about TLABs?). Your dump indicates you're using parallel scavenge so try

  • the serial (young) collector (IIRC it can be combined with a parallel old)
  • ParNew + CMS
  • G1

if it doesn't recur with the different GC algos then you know it's down to that (and you have no fix but to change GC algo and/or walk back through older JVMs until you find a version of that algo that doesn't blow).

薄荷梦 2024-08-28 09:18:06

一些想法:

  • 使用不同的 JDK、Tomcat 和/或操作系统版本
  • 稍微修改测试参数,例如 25 个线程,每天 720 万页面浏览量
  • 监控或分析内存使用情况
  • 调试或调整垃圾收集器
  • 运行静态和动态分析

A few ideas:

  • Use a different JDK, Tomcat and/or OS version
  • Slightly modify test parameters, e.g. 25 threads at 7.2 M pageviews/day
  • Monitor or profile memory usage
  • Debug or tune the Garbage Collector
  • Run static and dynamic analysis
梨涡少年 2024-08-28 09:18:06

您尝试过不同的硬件吗?您似乎使用的是 64 位架构。根据我自己的经验,32 位更快、更稳定。也许某处也存在硬件问题。 “4-24 小时之间”的计时很分散,只是一个软件问题。虽然你确实说系统日志没有错误,所以我可能离题很远。还是觉得值得一试。

Have you tried different hardware? It looks like you're using a 64-bit architecture. In my own experience 32-bit is faster and more stable. Perhaps there's a hardware issue somewhere too. Timing of "between 4-24 hours" is quite spread out to be just a software issue. Although you do say system log has no errors, so I could be way off. Still think its worth a try.

叹沉浮 2024-08-28 09:18:06

你的记忆力会随着时间的推移而增长吗?如果是这样,我建议将内存限制更改得更低,以查看当内存耗尽时系统是否会更频繁地失败。

如果满足以下条件,您能否更快地重现该问题:

  • 减少 JVM 可用的内存?
  • 您减少了可用的系统资源(即耗尽系统内存,因此 JVM 没有足够的内存)
  • 您将用例更改为更简单的模型?

我使用的主要策略之一是确定哪个用例导致了问题。这可能是一个通用问题,也可能是特定于用例的问题。尝试记录用例的开始和停止,看看是否可以确定哪些用例更有可能导致问题。如果将用例分成两半,请查看哪一半失败最快。这可能是更常见的失败原因。当然,对每种配置进行一些试验会提高测量的准确性。

我还知道要么更改服务器以执行很少的工作,要么循环执行服务器正在执行的工作。一个使您的应用程序代码工作更加困难,另一个使 Web 服务器和应用程序服务器工作更加困难。

祝你好运,
雅各布

Does your memory grow over time? If so, I suggest changing the memory limits lower to see if the system is failing more frequently when the memory is exhausted.

Can you reproduce the problem faster if:

  • You decrease the memory availble to the JVM?
  • You decrease the available system resources (i.e. drain system memory so JVM does not have enough)
  • You change your use cases to a simpler model?

One of the main strategies that I have used is to determine which use case is causing the problem. It might be a generic issue, or it might be use case specific. Try logging the start and stopping of use cases to see if you can determine which use cases are more likely to cause the problem. If you partition your use cases in half, see which half fails the fastest. That is likely to be a more frequent cause of the failure. Naturally, running a few trials of each configuration will increase the accuracy of your measurements.

I have also been known to either change the server to do little work or loop on the work that the server is doing. One makes your application code work a lot harder, the other makes the web server and application server work a lot harder.

Good luck,
Jacob

冬天的雪花 2024-08-28 09:18:06

尝试将 Servlet 容器从 Tomcat 切换到 Jetty http://jetty.codehaus.org/jetty/

Try switching your servlet container from Tomcat to Jetty http://jetty.codehaus.org/jetty/.

你的背包 2024-08-28 09:18:06

如果我是您,我会执行以下操作:

  • 尝试稍旧的 Tomcat/JVM 版本。你似乎正在运行最新、最好的。我会使用两个版本左右,可能会尝试 JRockit JVM。
  • 在应用程序运行时执行线程转储(kill -3 java_pid)以查看完整堆栈。您当前的转储显示大量线程被阻塞 - 但不清楚它们在哪里阻塞(I/O?一些内部锁匮乏?还有其他什么吗?)。我什至可能会安排 kill -3 每分钟运行一次,以将任何随机线程转储与崩溃之前的线程转储进行比较。
  • 我见过 Linux JDK 死掉而 Windows JDK 能够优雅地捕获异常(当时是 StackOverflowException)的情况,因此,如果您可以修改代码,请在顶级类中的某个位置添加“catch Throwable”。万一。
  • 使用 GC 调整选项。打开/关闭并发GC,调整NewSize/MaxNewSize。是的,这不科学——而是迫切需要有效的解决方案。更多详细信息请参见:http://java.sun.com/javase/ Technologies/hotspot/gc/gc_tuning_6.html

让我们知道这是如何解决的!

If I was you, I'd do the following:

  • try slightly older Tomcat/JVM versions. You seem to be running the newest and greatest. I'd go down two versions or so, possibly try JRockit JVM.
  • do a thread dump (kill -3 java_pid) while the app is running to see the full stacks. Your current dump shows lots of threads being blocked - but it is not clear where do they block (I/O? some internal lock starvation? anything else?). I'd even maybe schedule kill -3 to be run every minute to compare any random thread dump with the one just before the crash.
  • I have seen cases where Linux JDK just dies whereas Windows JDK is able to gracefully catch an exception (was StackOverflowException then), so if you can modify the code, add "catch Throwable" somewhere in the top class. Just in case.
  • Play with GC tuning options. Turn concurrent GC on/off, adjust NewSize/MaxNewSize. And yes, this is not scientific - rather desperate need for working solution. More details here: http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html

Let us know how this was sorted out!

爱你是孤单的心事 2024-08-28 09:18:06

是否可以选择使用 32 位 JVM?我相信这是 Sun 最成熟的产品。

Is it an option to go to the 32-bit JVM instead? I believe it is the most mature offering from Sun.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文