prometheus-rules如何监控每一台机是否存在指定的值?
使用dcgm-exporter监控,打算监控每台机的显卡是否正常。
一开始是检查UP是否打开,发现出错的显卡机器状态也是UP
groups:
- name: dcgm
rules:
- alert: dcgm_down
expr: up{job="gpu_worker"} == 0
for: 5m
labels:
severity: 1
team: node_down
annotations:
summary: "主机显卡报错!"
value: '告警值:{{ $value }}'
description: "{{ $labels.instance }} 检测到主机显卡报错, 请检查!!!"
想到的办法是监控指定的值,使用count统计有多少个值,再减去实际显卡机器。这个做法不是很好,有一台机器的显卡故障后,不清楚是那台机器,要每一台都检查下
expr: 40 - count(DCGM_FI_DEV_MEM_CLOCK) != 0
目前告警只是提示有多少台故障,无法查看故障机器的IP
有什么办法可以检查每台是否存在指定的值或查看故障的IP?
显卡机器正常显示的内容
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 5005
DCGM_FI_DEV_GPU_TEMP{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 40
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 0
DCGM_FI_DEV_GPU_UTIL{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 42
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 2
DCGM_FI_DEV_ENC_UTIL{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 0
DCGM_FI_DEV_DEC_UTIL{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 31
DCGM_FI_DEV_POWER_VIOLATION{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 0
DCGM_FI_DEV_SYNC_BOOST_VIOLATION{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 0
DCGM_FI_DEV_FB_FREE{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 11176
DCGM_FI_DEV_FB_USED{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 0
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0", UUID="GPU-c38c0a66-4a28-634a-efe6-3021ccdb6d21", device="nvidia0"} 0
显卡机器故障显示的内容
DCGM_FI_DEV_XID_ERRORS{gpu="0", UUID="GPU-13f86750-22fd-89d9-aa3c-749442184ce5", device="nvidia0"} 45
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0", UUID="GPU-13f86750-22fd-89d9-aa3c-749442184ce5", device="nvidia0"} 0
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
已找到方法解决,expr表达式为:
参考链接:prometheus - alerting missing metric for many hosts in alertmanager - Stack Overflow