Prometheus 中的表达以确定随时间的恒定增长
我们从 ActiveMQ Artemis 2.10.0 实例到 Prometheus 获取有关队列中消息数量的指标,并且在一定时间(假设 8 小时)内队列增长且不减少(通常)时,我需要收到通知这表明从队列中提取消息的服务存在问题)。
像这样:
但是,如果我在下图中看到类似的内容,即峰值增长随后下降,则不应触发警报:
现在我使用这个表达式,但有时由于大幅增长,即使随后下降,它也无法正常工作:
floor((predict_linear(artemis_message_count{job="activemq",queue=~".*"}[24h], 3600 * 24 * 1))) - max_over_time(artemis_message_count{job="activemq",queue=~".*"}[24h]) > 0
无法弄清楚使用哪种表达式更好,以减少虚假警报。如果有提示,将不胜感激。
We get metrics about the number of messages in queues from our instance of ActiveMQ Artemis 2.10.0 to Prometheus, and I need to be notified when for a certain amount of time (let's say 8 hours) the queue grows and does not decrease (usually this indicates a problem with the service that pulls messages from queues).
Like this:
But if I see something like this in the image below i.e. peak growth followed by a decrease, then the alert should not be triggered:
Now I use this expression, but sometimes it does not work correctly due to large growth spurts even with a subsequent decrease:
floor((predict_linear(artemis_message_count{job="activemq",queue=~".*"}[24h], 3600 * 24 * 1))) - max_over_time(artemis_message_count{job="activemq",queue=~".*"}[24h]) > 0
Can't figure out which expression is better to use in order to have fewer fake alerts. Would be grateful for a hint.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有一个巧妙的技巧可以检查普罗米修斯指标在一定时间间隔内的持续增长/减少/任何条件。
下面我们测量 TARGET_METRIC > 时的时间分数过去 8 小时内为 0,分辨率为 5 分钟。
有 bool 运算符修饰符,如果满足条件则返回 1,否则返回 0。因此,avg_over_time 计算这些 1 和 0 的平均值。如果在这 8 小时长的时间间隔内从未满足条件,我们将得到准确的 0 结果。如果始终满足条件,我们会得到精确的 1。如果有时满足条件,则得到介于两者之间的所有值。
当然,它可以是任何条件和任何指标,包括函数(!)。
现在回到你的例子。
我们需要检查指标是否随着时间的推移不断增长。让我们使用 deriv 函数来检查指标是否增加、减少或保持相同的值。
我们到了。上述条件检查
artemis_message_count{job="activemq",queue=~".*"}
在过去 8 小时内是否有所增加(分辨率为 5 分钟)。如果某些偏差是可以接受的,那么我们将“1”替换为“0.95”或其他值。
There's a neat trick to check Prometheus metric for constant growth/diminishment/whatever condition over a certain time interval.
Below we measure time fraction when TARGET_METRIC > 0 over last 8 hours with 5 minutes resolution.
There's bool operator modifier which return 1 if the condition is met and 0 otherwise. Hence, avg_over_time calculates average value of these 1's and 0's. If the condition was never met over this 8 hour long interval, we got an exact 0 as a result. If the condition was always met, we got an exact 1. And everything in between, if the condition was met sometimes.
Of course, it could be any condition and any metric including functions (!).
Now back to your example.
We need to check if the metric grows constantly over time. Let's use
deriv
function to check whether the metric increases, decreases or keeps the same value.Here we are. The condition above checks if
artemis_message_count{job="activemq",queue=~".*"}
increased over the last 8 hours with 5 minutes resolution.If some deviation is acceptable then we replace "1" with "0.95" or something.