当指标数据丢失时,Prometheus 发送已解决的通知
我们使用 Prometheus Alertmanager 进行警报。通常,由于某些连接问题,我们会丢失指标。
因此,当指标丢失时,Prometheus 会清除警报并发送已解决的警报。 几分钟后,连接问题修复并重复触发警报。
当指标数据丢失时,有什么方法可以停止已解决的警报吗?
例如;当节点关闭时,该节点的其他警报(CPU、磁盘使用控制)将得到解决。
Alertmanager 配置上的值:
repeat_interval: 1d
resolve_timeout: 15m
group_wait: 1m30s
group_interval: 5m
scrape_interval: 1m
scrape_timeout: 1m
evaluation_interval: 30s
NodeDown 警报:
- alert: NodeDown
expr: up == 0
for: 30s
labels:
severity: critical
alert_group: host
annotations:
summary: "Node is down: instance {{ $labels.instance }}"
description: "Can't react to node_exporter at {{ $labels.instance }}. Probably host is down."
We use Prometheus Alertmanager for alerts. Frequently, we are missing metrics because of some connection problems.
So, when metrics are missing, Prometheus clear alerts and send resolved alert.
After a few minutes, connection problem fixed and firing alerts repeating.
Is there any way to stop the resolved alerts when metric data missing?
For example; When a node down, other alerts for the node(cpu, disk usage controls) are resolved.
values on alertmanager config:
repeat_interval: 1d
resolve_timeout: 15m
group_wait: 1m30s
group_interval: 5m
scrape_interval: 1m
scrape_timeout: 1m
evaluation_interval: 30s
NodeDown alert:
- alert: NodeDown
expr: up == 0
for: 30s
labels:
severity: critical
alert_group: host
annotations:
summary: "Node is down: instance {{ $labels.instance }}"
description: "Can't react to node_exporter at {{ $labels.instance }}. Probably host is down."
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Alertmanager 可以在某些条件下抑制(=自动静音)警报。在抑制条件再次为假之前,您将看不到既不触发也不解决的抑制警报。以下是此类规则的示例:
总而言之,当指标源关闭时,上述规则会自动消除特定计算机的所有警告警报。上面的链接将引导您访问文档,您可以在其中找到有关该主题的更多信息。
Alertmanager can inhibit (=automatically silence) alerts on certain conditions. You will not see inhibited alerts neither firing, nor resolving until the inhibiting condition is false again. Here is an example of one such rule:
To summarize, the above automatically silences all warning alerts for a certain machine, when the metric source is down. The link above will lead you to the documentation, where you can find more on the subject.
您是否考虑过使用
last_over_time
函数?像这样:Did you consider using the
last_over_time
function? Like this: