诊断集群软件的异常行为

发布于 2024-09-13 10:25:18 字数 442 浏览 11 评论 0原文

我在一个小型集群上使用一种负载均衡器,它能够在零持续时间请求上实现 >2000rps(工作节点立即满足的请求)。 但是,一旦请求不再是零持续时间并开始花费 1 毫秒,性能就会立即下降 10 倍以上。两个方向传输的数据相同,大小约为 2kb。 这肯定与集群或网络吞吐量的饱和度无关,因为 1ms 请求的 200rps 是一个非常小的负载,而网络是 10Gbit。此外,负载均衡器和工作节点上的 CPU 负载仅为 2-5% 左右。

我想知道这是否可能与操作系统调度程序或操作系统网络堆栈的某些病态行为有关(对于非常短的交互,有一些特殊情况的行为)。

我该如何诊断原因?需要关注哪些性能计数器?使用什么工具或方法?

(以防万一有人知道我的特定问题的答案,我正在谈论 MS HPC Server 2008 R2 的“WCF Broker”,通过 Hyper-V 在 Windows Server 2008 R2 上运行)

I'm using a kind of load balancer over a small cluster that is able to achieve >2000rps on zero-duration requests (t.i. ones that are immediately satisfied by the worker nodes).
But as soon as the requests stop being zero-duration and start taking even 1ms, performance immediately drops >10x. The data being transfered in both directions is identical and is about 2kb in size.
This is for sure not related to saturation of the cluster or network throughput, because 200rps of 1ms requests is a very tiny load and the network is 10Gbit. Besides, the CPU load is just some 2-5% both on the load balancer and on the worker nodes.

I wonder whether that might be related to some pathological behavior of the OS scheduler, or the OS network stack (t.i. there is some special case behavior for very short interactions).

How might I diagnose the reason? Which perfcounters to watch? What tools or methodologies to use?

(Just in case someone simply knows the answer to my particular problem, I'm talking about the MS HPC Server 2008 R2's "WCF Broker", running on Windows Server 2008 R2 over Hyper-V)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

烂人 2024-09-20 10:25:18

您可以做的一件事是使用 ETW 跟踪来尝试了解 WCF 作业运行时节点正在做什么。在 HPC 服务器上,我有时 clusrun xperf 来收集所有或特定节点上的跟踪。您可以使用许多工具来分析 ETW 跟踪,包括 xperf 本身。我还没有使用 HPC SOA (WCF) 做过任何认真的工作,但我确实编写了一个简单的 WCF 光线跟踪器应用程序,然后使用 xperf 在几个节点上对其进行分析。

One thing you can do is use ETW tracing to try and understand what the nodes are doing while your WCF job is running. On HPC server, I sometimes clusrun xperf to collect traces on all or specific nodes. There are a number of tools that you can use for analyzing ETW traces, including xperf itself. I haven't done any serious work using HPC SOA (WCF), but I did write a simple WCF raytracer app and then used xperf to profile it on several of the nodes.

贩梦商人 2024-09-20 10:25:18

事实证明,这是一个完全与网络无关的问题,与 HPC 服务器的调度机制的特殊性有关。我通过将 WCF 服务配置文件的 loadBalancing 部分中的配置选项“serviceRequestPrefetchCount”调整为 0 解决了该问题。

Turned out it was a completely network-unrelated issue having to do with peculiarities of the scheduling mechanism of HPC Server. I resolved the issue by tweaking a configuration option "serviceRequestPrefetchCount" to 0 in the loadBalancing section of the WCF service config file.

赠佳期 2024-09-20 10:25:18

我假设有一些共享资源具有某种锁定系统?锁定是瓶颈吗?不看系统很难猜测。

你有办法了解工人的情况吗?他们大部分时间都花在什么上,尤其是在快与慢的场景中?

I'm assuming that there are some shared resources with some kind of locking system in place? Is locking a bottleneck? It's hard to guess without seeing the system.

Do you have a way to profile the workers? What are they spending most of their time on, especially in the fast vs slow scenarios?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文