了解 CPU Busy Prometheus 查询
我是 Grafana 和 Prometheus 的新手。我已经阅读了大量文档,现在我正在尝试通过查看一些现有查询并确保我理解它们来向后工作
我已经下载了 Node Exporter Full 仪表板 (https://grafana.com/grafana/dashboards/1860)。我一直在查看 CPU Busy 查询,并且我“ma有点困惑。我在下面引用它,间隔开,以便我们可以更好地看到嵌套部分:
在此查询中,job
是 node-exporter
而instance
是服务器的IP和端口。这是我对查询的基本理解: node_cpu_seconds_total
是 CPU 在给定样本上花费的秒数的计数器。
- 第 5 行:获取给定时刻的 cpu 秒数,按各个 CPU 内核细分
- 第 4 行:将所有内核的所有 CPU 秒数相加
- 第 3 行:为什么需要额外的 count()?它有什么作用吗?
- 第 12 行:速率向量 - 获取 cpu 在给定速率周期内空闲的 cpu 秒数
- 第 11 行:采用速率将其转换为 cpu 秒的变化率(并返回 第 10行
- :按 CPU 模式细分的所有速率求和
- 第 9 行:取所有 CPU 模式速率的单个平均速率
- 第 8 行:从总 CPU 秒数(第 3 行)中减去平均变化率(第 9 行)
- 线16:乘以 100 将分钟转换为秒 10:第 18-20 行:将第 19 行除以所有 CPU 上的所有 CPU 秒数
我的问题如下:
- 我本以为 CPU 使用率只是(所有非空闲 cpu 使用率)/(总 cpu用法)。因此,我不明白为什么要考虑速率(#6和#8)
- 这里的分子似乎试图获取所有非空闲使用情况,并通过获取总和并减去空闲时间来实现这一点。但为什么一个使用计数,另一个使用求和呢?
- 如果我们通过按
mode=idle
过滤来获取 cpu 秒,那么添加by (mode)
会添加任何内容吗?反正只有一种模式? 我对by (something)
的理解更加相关 - 当存在多个值并且我们按该类别对值进行分组(就像我们在此查询中按
cpu
所做的那样)时, 上面粗体提到的,分子和分母中的 double count() 是什么?
I am new to Grafana and Prometheus. I have read a lot of documentation and now I"m trying to work backwards by reviewing some existing queries and making sure I understand them
I have downloaded the Node Exporter Full dashboard (https://grafana.com/grafana/dashboards/1860). I have been reviewing the CPU Busy query and I"m a bit confused. I am quoting it below, spaced out so we can see the nested sections better:
In this query, job
is node-exporter
while instance
is the IP and port of the server. This is my base understanding of the query:node_cpu_seconds_total
is a counter of the number of seconds the CPU took at a given sample.
- Line 5: Get cpu seconds at a given instant, broken down by the individual CPU cores
- Line 4: Add up all CPU seconds across all cores
- Line 3: Why is there an additional count()? Does it do anything?
- Line 12: Rate vector - get cpu seconds of when the cpu was idle over the given rate period
- Line 11: Take a rate to transfer that into the rate of change of cpu seconds (and return an instant vector)
- Line 10: Sum up all rates, broken down by CPU modes
- Line 9: Take the single average rate across all CPU mode rates
- Line 8: Subtract the average rate of change (Line 9) from total CPU seconds (Line 3)
- Line 16: Multiple by 100 to convert minutes to seconds
10: Line 18-20: Divide Line 19 by the count of the count of all CPU seconds across all CPUs
My questions are as follows:
- I would have thought that CPU usage would simply be (all non idle cpu usage) / (total cpu usage). I therefore don't understand why take into account rate at all (#6 and #8)
- The numerator here seems to be trying to get all non-idle usage and does so by getting the full sum and subtracting the idle time. But why does one use count and the other sum?
- If we grab cpu seconds by filtering by
mode=idle
, then does adding theby (mode)
add anything? There is only one mode anyways? My understanding ofby (something)
is more relevant when there are multiple values and we group the values by that category (as we do bycpu
in this query) - Lastly, as mentioned in bold above, what is with the double count(), in the numerator and denominator?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这两个计数函数都会返回 CPU 核心的数量。如果您将它们从这个长查询中取出并执行,它会立即有意义:
通过将上述内容放入另一个
count()
函数中,您将获得值2
,因为数据集中只有 2 个指标。此时,我们可以将原始查询简化为:但是,其余部分有点复杂。这:
...本质上是所有 CPU 核心的空闲时间总和(我故意跳过时间上下文以使其更简单)。目前还不清楚为什么会有
by (mode)
,因为rate函数里面有一个过滤器,这使得只有idle
模式出现成为可能。无论有没有by (mode)
,它都只返回一个值:avg()
除此之外根本没有任何意义。 我假设,目的是获取每个 CPU 的空闲时间量(通过 ( cpu)
,即)。在这种情况下,它开始有意义,尽管它仍然没有必要复杂。因此,此时我们可以将查询简化为:我不知道为什么这么复杂,您可以通过如下简单查询得到相同的结果:
Both of these count functions return the amount of CPU cores. If you take them out of this long query and execute, it'll immediately make sense:
By putting the above into another
count()
function, you will get a value of2
, because there are just 2 metrics in the dataset. At this point, we can simplify the original query to this:The rest, however, is somewhat complicated. This:
... is essentially the sum of idle time of all CPU cores (I'm intentionally skipping the context of time to make it simpler). It's not clear why there is
by (mode)
, since the rate function inside has a filter, which makes it possible for onlyidle
mode to appear. With or withoutby (mode)
it returns just one value:avg()
on top of that makes no sense at all. I assume, that the intention was to get the amount of idle time per CPU (by (cpu)
, that is). In this case it starts to make sense, although it is still unnecessary complex. Thus, at this point we can simplify the query to this:I don't know why it is so complicated, you can get the same result with a simple query like this: