JVM 在 RHEL 5.2 上的压力下崩溃
经过 4 到 24 小时 4 小时到 8 天的压力测试后,在(当前最新的)tomcat 6.0.24 上运行 Web 应用程序时,(当前最新的)jdk 1.6.0.18 意外崩溃(30 个线程点击应用程序,页面浏览量/天)。这是在 RHEL 5.2 (Tikanga) 上。
崩溃报告位于 http://pastebin.com/f639a6cf1 和一致部分崩溃的原因是:
- 抛出 SIGSEGV
- libjvm 上
- 。因此 eden 空间始终已满 (100%)
JVM 使用以下选项运行:
CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"
我还使用 http://memtest.org/ 48 小时(整个内存的 14 次传递)没有任何错误。
我已启用 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 来检查任何 GC 趋势或空间耗尽,但没有任何可疑之处。 GC 和 full GC 以可预测的时间间隔发生,几乎总是释放相同数量的内存容量。
我的应用程序不直接使用任何本机代码。
我下一步应该去哪里有什么想法吗?
编辑 - 更多信息:
1)此 JDK 中没有客户端虚拟机:
[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
2)无法更改操作系统。
3)我不想更改JMeter压力测试变量,因为这可能会隐藏问题。由于我有一个使 JVM 崩溃的用例(当前的压力测试场景),我想修复崩溃而不更改测试。
4) 我已经对我的应用程序进行了静态分析,但没有出现任何严重的结果。
5)内存不会随着时间的推移而增长。内存使用量以非常稳定的趋势很快达到平衡(启动后),这似乎并不可疑。
6) /var/log/messages 在崩溃之前或期间不包含任何有用的信息
更多信息:忘记提及有一个使用 mod_jk 1.2 的 apache (2.2.14) fronting tomcat .28.现在我正在没有 apache 的情况下运行测试,以防 JVM 崩溃与连接到 JVM(tomcat 连接器)的 mod_jk 本机代码相关。
之后(如果 JVM 再次崩溃)我将尝试从应用程序中删除一些组件(缓存、lucene、quartz),稍后将尝试使用 jetty。由于目前崩溃发生在 4 小时到 8 天内的任何时间,因此可能需要很长时间才能查明发生了什么情况。
I've got (the currently latest) jdk 1.6.0.18 crashing while running a web application on (the currently latest) tomcat 6.0.24 unexpectedly after 4 to 24 hours 4 hours to 8 days of stress testing (30 threads hitting the app at 6 mil. pageviews/day). This is on RHEL 5.2 (Tikanga).
The crash report is at http://pastebin.com/f639a6cf1 and the consistent parts of the crash are:
- a SIGSEGV is being thrown
- on libjvm.so
- eden space is always full (100%)
JVM runs with the following options:
CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"
I've also tested the memory for hardware problems using http://memtest.org/ for 48 hours (14 passes of the whole memory) without any error.
I've enabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
to inspect for any GC trends or space exhaustion but there is nothing suspicious there. GC and full GC happens at predicable intervals, almost always freeing the same amount of memory capacities.
My application does not, directly, use any native code.
Any ideas of where I should look next?
Edit - more info:
1) There is no client vm in this JDK:
[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
2) Changing the O/S is not possible.
3) I don't want to change the JMeter stress test variables since this could hide the problem. Since I've got a use case (the current stress test scenario) which crashes the JVM I'd like to fix the crash and not change the test.
4) I've done static analysis on my application but nothing serious came up.
5) The memory does not grow over time. The memory usage equilibrates very quickly (after startup) at a very steady trend which does not seem suspicious.
6) /var/log/messages does not contain any useful information before or during the time of the crash
More info: Forgot to mention that there was an apache (2.2.14) fronting tomcat using mod_jk 1.2.28. Right now I'm running the test without apache just in case the JVM crash relates to the mod_jk native code which connects to JVM (tomcat connector).
After that (if JVM crashes again) I'll try removing some components from my application (caching, lucene, quartz) and later on will try using jetty. Since the crash is currently happening anytime between 4 hours to 8 days, it may take a lot of time to find out what's going on.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
你有编译器输出吗?即
PrintCompilation
(如果你感觉特别勇敢,可以使用 LogCompilation)。我通过观察编译器正在做什么来调试这样的情况,最终(这花了很长时间直到灯泡时刻),意识到我的崩溃是由 oracle jdbc 驱动程序中的特定方法的编译引起的。
基本上我会做的是;
如果存在可辨别的模式,则使用 .hotspot_compiler (或 .hotspotrc)使其停止编译有问题的方法,重复测试并看看它不会爆炸。显然,在你的情况下,这个过程理论上可能需要几个月的时间。
处理日志编译输出的一些参考
我要做的另一件事是系统地更改您正在使用的 gc 算法并检查 gc 的崩溃时间活动(例如,它与年轻的还是年老的GC相关,TLABs又如何?)。您的转储表明您正在使用并行清除,因此请尝试
如果它不会与不同的GC算法一起重复出现,那么您就知道它是这样的(并且您没有解决办法,只能更改 GC 算法和/或返回较旧的 JVM,直到找到不会崩溃的算法版本)。
Do you have compiler output? i.e.
PrintCompilation
(and if you're feeling particularly brave, LogCompilation).I have debugged a case like this in the part by watching what the compiler is doing and, eventually (this took a long time until the light bulb moment), realising that my crash was caused by compilation of a particular method in the oracle jdbc driver.
Basically what I'd do is;
If there is a discernable pattern then use .hotspot_compiler (or .hotspotrc) to make it stop compiling the offending method(s), repeat the test and see if it doesn't blow up. Obviously in your case this process could theoretically take months I'm afraid.
some references
The other thing I'd do is systematically change the gc algorithm you're using and check the crash times against gc activity (e.g. does it correlate with a young or old gc, what about TLABs?). Your dump indicates you're using parallel scavenge so try
if it doesn't recur with the different GC algos then you know it's down to that (and you have no fix but to change GC algo and/or walk back through older JVMs until you find a version of that algo that doesn't blow).
一些想法:
A few ideas:
您尝试过不同的硬件吗?您似乎使用的是 64 位架构。根据我自己的经验,32 位更快、更稳定。也许某处也存在硬件问题。 “4-24 小时之间”的计时很分散,只是一个软件问题。虽然你确实说系统日志没有错误,所以我可能离题很远。还是觉得值得一试。
Have you tried different hardware? It looks like you're using a 64-bit architecture. In my own experience 32-bit is faster and more stable. Perhaps there's a hardware issue somewhere too. Timing of "between 4-24 hours" is quite spread out to be just a software issue. Although you do say system log has no errors, so I could be way off. Still think its worth a try.
你的记忆力会随着时间的推移而增长吗?如果是这样,我建议将内存限制更改得更低,以查看当内存耗尽时系统是否会更频繁地失败。
如果满足以下条件,您能否更快地重现该问题:
我使用的主要策略之一是确定哪个用例导致了问题。这可能是一个通用问题,也可能是特定于用例的问题。尝试记录用例的开始和停止,看看是否可以确定哪些用例更有可能导致问题。如果将用例分成两半,请查看哪一半失败最快。这可能是更常见的失败原因。当然,对每种配置进行一些试验会提高测量的准确性。
我还知道要么更改服务器以执行很少的工作,要么循环执行服务器正在执行的工作。一个使您的应用程序代码工作更加困难,另一个使 Web 服务器和应用程序服务器工作更加困难。
祝你好运,
雅各布
Does your memory grow over time? If so, I suggest changing the memory limits lower to see if the system is failing more frequently when the memory is exhausted.
Can you reproduce the problem faster if:
One of the main strategies that I have used is to determine which use case is causing the problem. It might be a generic issue, or it might be use case specific. Try logging the start and stopping of use cases to see if you can determine which use cases are more likely to cause the problem. If you partition your use cases in half, see which half fails the fastest. That is likely to be a more frequent cause of the failure. Naturally, running a few trials of each configuration will increase the accuracy of your measurements.
I have also been known to either change the server to do little work or loop on the work that the server is doing. One makes your application code work a lot harder, the other makes the web server and application server work a lot harder.
Good luck,
Jacob
尝试将 Servlet 容器从 Tomcat 切换到 Jetty http://jetty.codehaus.org/jetty/ 。
Try switching your servlet container from Tomcat to Jetty http://jetty.codehaus.org/jetty/.
如果我是您,我会执行以下操作:
让我们知道这是如何解决的!
If I was you, I'd do the following:
Let us know how this was sorted out!
是否可以选择使用 32 位 JVM?我相信这是 Sun 最成熟的产品。
Is it an option to go to the 32-bit JVM instead? I believe it is the most mature offering from Sun.