是否有可能通过Prometheous获得准确的每分钟指标
目标
跟踪RPM和通过Grafana& Prometheus
情况
我们正在使用
django-prometheus -> To emit metrics
fluent-bit -> Scrapes django metrics every 15s and pushes to prometheus
prometheus -> 2 shards running via prometheus operator on k8s
问题
当我们将Grafana仪表板与AWS目标组请求指标进行比较时,它不匹配。 尝试了所有以下选项
Expr: sum by(service) (irate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (increase(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (rate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
django_http_requests_before_middlewares_total -> This is Counter data type.
This counter never resets because we have unique dimension
- container_id
- service_name
- namespace
q。是否可以在Grafana上创建类似于AWS目标组指标的仪表板?
理想情况下,<代码>增加应该有效,但它需要持续的差异,这可能给出错误的结果。
提前致谢。
Goal
Track RPM and Up time via grafana & prometheus
Situation
We are using
django-prometheus -> To emit metrics
fluent-bit -> Scrapes django metrics every 15s and pushes to prometheus
prometheus -> 2 shards running via prometheus operator on k8s
Problem
When we compare grafana dashboard with aws target group request metrics it isn't matching.
Tried all below options
Expr: sum by(service) (irate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (increase(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (rate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
django_http_requests_before_middlewares_total -> This is Counter data type.
This counter never resets because we have unique dimension
- container_id
- service_name
- namespace
Q. Is it possible to create dashboard on grafana which resembles aws target group metrics ?
Ideally increase
should work but it takes diff continuously and that might be giving incorrect result.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
从理论上讲,以下查询应返回最后一分钟的确切每服务请求的确切数量:
但是实际上,Prometheus可能会返回此查询的意外结果:
增加(m [d])
将返回d&lt; = 1m
的空结果。Prometheus开发人员知道这些问题并将解决这些问题 - 请参阅。
同时,您可以尝试使用
增加()
victoriametrics - 这是我使用的类似Prometheus的监视解决方案。它的'增加函数与上述问题免费。一个重要的说明:Prometheus和Victoriametrics comle courter courter courter courcal courcle和Victoriametrics comle counter courcal courcal courcle courcal courcle和victoriametrics courter cours counterical courcle courts cours courthe cours cy cous cous ciles cour cy ciles cours cy cile cluctiage的均独立计算图表上显示的每个点。因此,如果您需要使用上面的查询显示每分钟的请求,则需要将图表上的点之间的间隔设置为一分钟。
In theory the following query should return the exact number of per-service requests for the last minute:
But in practice Prometheus may return unexpected results for this query:
[1m]
in the query above) and the first raw sample in the lookbehind window.increase(m[d])
would return empty results ford <= 1m
.Prometheus developers are aware of these issues and are going to fix them - see this design doc.
In the mean time you can try using
increase()
function in VictoriaMetrics - this is Prometheus-like monitoring solution I work on. Its' increase function is free from issues mentioned above.An important note: both Prometheus and VictoriaMetrics calculate query results independently per each point displayed on the graph. So, if you need displaying per-minute number of requests using the query above, you need to set the interval between points on the graph (aka
step
) to one minute.tl; dr-不,Prometheus并没有保留足够的数据来提供完全精确的值。
要了解为什么,让我们假设1分钟前Prometheus刮擦了
10
的值http_requests
,现在它已更新为40
。很明显,使用
1M
对您不完全知道这30个请求的最后一刻。这是短尖峰还是分布均匀?无论如何,rate(http_requests [1M])
将为您提供(40-10)/60s = 0.5
每秒请求。增加()
以相同的方式工作,它是rate()*Interval
或0.5*60 = 30 = 30
。尽管上面的示例显示了精确的值,但应该很明显,您将无法通过此数学实现完美的精度。除非您要处理缓慢的计数器(几分钟更新一次),否则错误通常是微不足道的。
tl;dr - no, Prometheus does not keep enough data to give perfectly precise values.
To see why, let's assume that 1 minute ago Prometheus has scraped a value of
10
for metrichttp_requests
and just now it has been updated to40
.It's already clear that with
1m
sampling you don't exactly know when during the last minute these 30 requests happened. Was it a short spike or were they distributed evenly? Regardless of that,rate(http_requests[1m])
will give you(40-10)/60s = 0.5
requests per second.Increase()
works in the same fashion, it'srate()*interval
or0.5*60 = 30
.Although, the example above shows precise values, it should be clear that you won't be able to achieve perfect precision with this math. The error is generally insignificant unless you are dealing with slow-moving counters (which update once in several minutes).