当指标数据丢失时,Prometheus 发送已解决的通知

发布于 2025-01-13 01:26:06 字数 763 浏览 0 评论 0原文

我们使用 Prometheus Alertmanager 进行警报。通常,由于某些连接问题,我们会丢失指标。

因此,当指标丢失时,Prometheus 会清除警报并发送已解决的警报。 几分钟后,连接问题修复并重复触发警报。

当指标数据丢失时,有什么方法可以停止已解决的警报吗?

例如;当节点关闭时,该节点的其他警报(CPU、磁盘使用控制)将得到解决。

Alertmanager 配置上的值:

  repeat_interval: 1d
  resolve_timeout: 15m

  group_wait: 1m30s
  group_interval: 5m

  scrape_interval: 1m
  scrape_timeout: 1m 
  evaluation_interval: 30s

NodeDown 警报:

  - alert: NodeDown
    expr: up == 0
    for: 30s
    labels:
      severity: critical
      alert_group: host
    annotations:
      summary: "Node is down: instance {{ $labels.instance }}"
      description: "Can't react to node_exporter at {{ $labels.instance }}. Probably host is down."
        

We use Prometheus Alertmanager for alerts. Frequently, we are missing metrics because of some connection problems.

So, when metrics are missing, Prometheus clear alerts and send resolved alert.
After a few minutes, connection problem fixed and firing alerts repeating.

Is there any way to stop the resolved alerts when metric data missing?

For example; When a node down, other alerts for the node(cpu, disk usage controls) are resolved.

values on alertmanager config:

  repeat_interval: 1d
  resolve_timeout: 15m

  group_wait: 1m30s
  group_interval: 5m

  scrape_interval: 1m
  scrape_timeout: 1m 
  evaluation_interval: 30s

NodeDown alert:

  - alert: NodeDown
    expr: up == 0
    for: 30s
    labels:
      severity: critical
      alert_group: host
    annotations:
      summary: "Node is down: instance {{ $labels.instance }}"
      description: "Can't react to node_exporter at {{ $labels.instance }}. Probably host is down."
        

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沦落红尘 2025-01-20 01:26:06

Alertmanager 可以在某些条件下抑制(=自动静音)警报。在抑制条件再次为假之前,您将看不到既不触发也不解决的抑制警报。以下是此类规则的示例:

inhibit_rules:
- # Mute alerts with "severity" label equals to "warning" ...
  target_matchers:
  - severity = warning

  # ... when an alert named "ExporterDown" is firing ...
  source_matchers:
  - alertname = ExporterDown

  # ... if both the muted and the firing alerts have exactly the same "job" and "instance" labels.
  equal: [instance, job]

总而言之,当指标源关闭时,上述规则会自动消除特定计算机的所有警告警报。上面的链接将引导您访问文档,您可以在其中找到有关该主题的更多信息。

Alertmanager can inhibit (=automatically silence) alerts on certain conditions. You will not see inhibited alerts neither firing, nor resolving until the inhibiting condition is false again. Here is an example of one such rule:

inhibit_rules:
- # Mute alerts with "severity" label equals to "warning" ...
  target_matchers:
  - severity = warning

  # ... when an alert named "ExporterDown" is firing ...
  source_matchers:
  - alertname = ExporterDown

  # ... if both the muted and the firing alerts have exactly the same "job" and "instance" labels.
  equal: [instance, job]

To summarize, the above automatically silences all warning alerts for a certain machine, when the metric source is down. The link above will lead you to the documentation, where you can find more on the subject.

千紇 2025-01-20 01:26:06

您是否考虑过使用 last_over_time 函数?像这样:

last_over_time(up[2h]) == 0

Did you consider using the last_over_time function? Like this:

last_over_time(up[2h]) == 0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文