机器在压力下的零星行为
我们正在做一些Java压力运行(涉及网络IO)。最初一切都很好,系统响应速度非常快(测试中的平均延迟为 2 毫秒)。但几个小时后,当我重做相同的测试时,我发现性能下降了(20 - 60 毫秒)。压力运行时使用相同的 Jar 文件、相同的 JVM 和相同的 LAN。我不明白这种行为的原因。
LAN 的速率为 1GBPS,出于压力要求,我确信我们不会使用全部 LAN。
所以我的问题是:
- 这可能是因为局域网中的一些交换机吗?
- 一段时间后机器是否会变慢(机器重新启动..大约在压力开始之前的 6 个月前;它们是 RHEL5、XEON 64 位四核)
- 调试此类问题的一般方法是什么?
We are doing some Java stress runs (involving network IO). Initially things are all fine and the system responds very fast (avg latency in test 2ms). But hours later when I redo the same test I observe the performance goes down (20 - 60ms). It's the same Jar files, same JVM, and the same LAN over which the stress is running. I am not understanding the reason for this behavior.
The LAN is 1GBPS and for the stress requirements I'm sure we are not using all of it.
So my questions:
- Can it be because of some switches in the LANs?
- Does the machine slow off after some time ( The machines are restarted .. say about 6months back well before the stress can start; They are RHEL5, XEON 64bit Quad core)
- What is the general way to debug such an issues?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有几个问题...
有多少环境在您的控制之下?您是否采取了任何措施来确保每次运行的一致性?即您是否与其他系统共享网络,您使用的机器是否仅用于压力测试?
我看待这个问题的方式是开始收集有关您的机器和代码的详细信息。这意味着使用 perfmon (windows) sar (unix) 找出操作系统和硬件正在做什么,并附加一个分析器以确保您的代码正在做同样的事情,并帮助从代码角度查明瓶颈发生的位置。
没有什么非常详细的内容,但我希望能帮助您入门。
A few questions...
How much of the environment is under your control and are you putting any measures in place to ensure it's consistent for each run? i.e. are you sharing the network with other systems, is the machine you're using being used solely for your stress testing?
The way I'd look at this is to start gathering details on what your machine and code are up to. That means use perfmon (windows) sar (unix) to find out what the OS and hardware is doing and get a profiler attached to make sure your code is doing the same thing and help pin-point where the bottleneck is occuring from a code perspective.
Nothing terribly detailed but something I hope that will help get you started.
一般的方法是“衡量一切”。这尤其可能意味着:
您可以从第五个元素开始,因为这是(您相信)您的关键链。但最好尽可能多地记录 - 根据您自己所说,需要几天才能产生不同的结果。
如果您不想修改代码,请寻找无需干预即可嗅探数据的情况(例如,在 web.xml 中定义 servlet 过滤器)。
The general way is "measure everything". This, in particular might mean:
You can probably start from the 5th element, as this is (you believe) your critical chain. But it is best to log as much as you can - as according to what you've said yourself, it takes days to produce different results.
If you don't want to modify your code, look for cases where you can sniff data without intervening (e.g. define a servlet filter in your web.xml).