机器在压力下的零星行为

发布于 2024-08-23 00:05:25 字数 348 浏览 7 评论 0原文

我们正在做一些Java压力运行(涉及网络IO)。最初一切都很好,系统响应速度非常快(测试中的平均延迟为 2 毫秒)。但几个小时后,当我重做相同的测试时,我发现性能下降了(20 - 60 毫秒)。压力运行时使用相同的 Jar 文件、相同的 JVM 和相同的 LAN。我不明白这种行为的原因。

LAN 的速率为 1GBPS,出于压力要求,我确信我们不会使用全部 LAN。

所以我的问题是:

  1. 这可能是因为局域网中的一些交换机吗?
  2. 一段时间后机器是否会变慢(机器重新启动..大约在压力开始之前的 6 个月前;它们是 RHEL5、XEON 64 位四核)
  3. 调试此类问题的一般方法是什么?

We are doing some Java stress runs (involving network IO). Initially things are all fine and the system responds very fast (avg latency in test 2ms). But hours later when I redo the same test I observe the performance goes down (20 - 60ms). It's the same Jar files, same JVM, and the same LAN over which the stress is running. I am not understanding the reason for this behavior.

The LAN is 1GBPS and for the stress requirements I'm sure we are not using all of it.

So my questions:

  1. Can it be because of some switches in the LANs?
  2. Does the machine slow off after some time ( The machines are restarted .. say about 6months back well before the stress can start; They are RHEL5, XEON 64bit Quad core)
  3. What is the general way to debug such an issues?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

闻呓 2024-08-30 00:05:25

有几个问题...

有多少环境在您的控制之下?您是否采取了任何措施来确保每次运行的一致性?即您是否与其他系统共享网络,您使用的机器是否仅用于压力测试?

我看待这个问题的方式是开始收集有关您的机器和代码的详细信息。这意味着使用 perfmon (windows) sar (unix) 找出操作系统和硬件正在做什么,并附加一个分析器以确保您的代码正在做同样的事情,并帮助从代码角度查明瓶颈发生的位置。

没有什么非常详细的内容,但我希望能帮助您入门。

A few questions...

How much of the environment is under your control and are you putting any measures in place to ensure it's consistent for each run? i.e. are you sharing the network with other systems, is the machine you're using being used solely for your stress testing?

The way I'd look at this is to start gathering details on what your machine and code are up to. That means use perfmon (windows) sar (unix) to find out what the OS and hardware is doing and get a profiler attached to make sure your code is doing the same thing and help pin-point where the bottleneck is occuring from a code perspective.

Nothing terribly detailed but something I hope that will help get you started.

对岸观火 2024-08-30 00:05:25

一般的方法是“衡量一切”。这尤其可能意味着:

  1. 确保所有服务器上的时间相同(使用 ntp 或类似的东西);
  2. 测量生成请求需要多长时间(如果请求生成器有错误怎么办?);
  3. 测量请求何时离开客户端计算机,或者至少执行 I/O 花费了多长时间。有时,了解许多请求所需的平均时间就足够了。
  4. 测量请求何时到达。
  5. 测量产生响应需要多长时间。
  6. 测量发送响应需要多长时间。

您可以从第五个元素开始,因为这是(您相信)您的关键链。但最好尽可能多地记录 - 根据您自己所说,需要几天才能产生不同的结果。

如果您不想修改代码,请寻找无需干预即可嗅探数据的情况(例如,在 web.xml 中定义 servlet 过滤器)。

The general way is "measure everything". This, in particular might mean:

  1. Ensure time on all servers is the same (use ntp or something similar);
  2. Measure how long did it take to generate request (what if request generator has a bug?);
  3. Measure when did request leave the client machine(s), or at least how long did it take to do i/o. Sometimes it is enough to know average time necessary for many requests.
  4. Measure when did the request arrive.
  5. Measure how long did it take to generate a response.
  6. Measure how long did it take to send the response.

You can probably start from the 5th element, as this is (you believe) your critical chain. But it is best to log as much as you can - as according to what you've said yourself, it takes days to produce different results.

If you don't want to modify your code, look for cases where you can sniff data without intervening (e.g. define a servlet filter in your web.xml).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文