如何自动检测局域网中不繁忙的机器？

发布于 2024-12-06 04:17:40 字数 176 浏览 1 评论 0原文

我正在编写一个在局域网上运行的 MPI 程序。任何学生都可以随时通过 ssh 连接到这些机器。

虽然我总是在晚上测试我的程序，但性能一直很不一致。我的猜测是，当我运行该程序时，某些节点正忙。

所以我的问题是：我可以编写一个脚本来检测不繁忙的机器并更新机器文件吗？有什么简单的写法吗？

多谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧梦荧光笔 2024-12-13 04:17:40

通过 SSH 连接到每台计算机，然后读取 /proc/loadavg 文件或以其他方式确定“业务”。

回复收藏 0 原文

哆兒滾 2024-12-13 04:17:40

我认为最简单的方法是将 check_load[1] 脚本从 Nagios 安装到您想要检查的每个节点，并通过 ssh 使用一些合理的参数调用它：

# /usr/lib64/nagios/plugins/check_load -w 1,2,3 -c 3,4,5
OK - load average: 0.20, 0.43, 0.50|load1=0.200;1.000;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.1,2,3 -c 3,4,5
WARNING - load average: 0.18, 0.43, 0.50|load1=0.180;0.100;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.01,2,3 -c
0.1,4,5
CRITICAL - load average: 0.41, 0.46, 0.51|load1=0.410;0.010;0.100;0; load5=0.460;2.000;4.000;0; load15=0.510;3.000;5.000;0;

CRITICAL 意味着“真的很忙”，警告可能是“有点忙”并且OK 意味着“机器处于空闲状态”。

您必须注意必须给出的警告和严重阈值，如 1/5/15 分钟；例如，一台 16 核机器的负载为 3 是完全可以的，而单核机器上的负载为 3 就意味着它真的很忙。

祝你好运！
亚历克斯.

[1] http://nagiosplugins.org/man/check_load

I think the easiest way would be installing the check_load[1] script from Nagios to every node you want to check and call it via ssh with some sensible parameters:

# /usr/lib64/nagios/plugins/check_load -w 1,2,3 -c 3,4,5
OK - load average: 0.20, 0.43, 0.50|load1=0.200;1.000;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.1,2,3 -c 3,4,5
WARNING - load average: 0.18, 0.43, 0.50|load1=0.180;0.100;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.01,2,3 -c
0.1,4,5
CRITICAL - load average: 0.41, 0.46, 0.51|load1=0.410;0.010;0.100;0; load5=0.460;2.000;4.000;0; load15=0.510;3.000;5.000;0;

CRITICAL would mean "really busy", WARNING could be "is kinda busy" and OK would mean "the machine is idle".

You have to pay attention for the tresholds you have to give as 1/5/15 minute for warning and critical; for instance, a machine with 16 cores having a load of 3 is perfectly ok, while a load of 3 on a single-core machine would mean it's really really busy.

Good luck!
Alex.

[1] http://nagiosplugins.org/man/check_load

回复收藏 0 原文

~没有更多了~