了解 CPU Busy Prometheus 查询

发布于 2025-01-15 03:16:16 字数 1438 浏览 2 评论 0原文

我是 Grafana 和 Prometheus 的新手。我已经阅读了大量文档,现在我正在尝试通过查看一些现有查询并确保我理解它们来向后工作

我已经下载了 Node Exporter Full 仪表板 (https://grafana.com/grafana/dashboards/1860)。我一直在查看 CPU Busy 查询,并且我“ma有点困惑。我在下面引用它,间隔开,以便我们可以更好地看到嵌套部分:

在此处输入图像描述

在此查询中,jobnode-exporterinstance是服务器的IP和端口。这是我对查询的基本理解: node_cpu_seconds_total 是 CPU 在给定样本上花费的秒数的计数器。

  1. 第 5 行:获取给定时刻的 cpu 秒数,按各个 CPU 内核细分
  2. 第 4 行:将所有内核的所有 CPU 秒数相加
  3. 第 3 行:为什么需要额外的 count()?它有什么作用吗?
  4. 第 12 行:速率向量 - 获取 cpu 在给定速率周期内空闲的 cpu 秒数
  5. 第 11 行:采用速率将其转换为 cpu 秒的变化率(并返回 第 10行
  6. :按 CPU 模式细分的所有速率求和
  7. 第 9 行:取所有 CPU 模式速率的单个平均速率
  8. 第 8 行:从总 CPU 秒数(第 3 行)中减去平均变化率(第 9 行)
  9. 线16:乘以 100 将分钟转换为秒 10:第 18-20 行:将第 19 行除以所有 CPU 上的所有 CPU 秒数

我的问题如下:

  • 我本以为 CPU 使用率只是(所有非空闲 cpu 使用率)/(总 cpu用法)。因此,我不明白为什么要考虑速率(#6和#8)
  • 这里的分子似乎试图获取所有非空闲使用情况,并通过获取总和并减去空闲时间来实现这一点。但为什么一个使用计数,另一个使用求和呢?
  • 如果我们通过按 mode=idle 过滤来获取 cpu 秒,那么添加 by (mode) 会添加任何内容吗?反正只有一种模式? 我对 by (something) 的理解更加相关
  • 当存在多个值并且我们按该类别对值进行分组(就像我们在此查询中按 cpu 所做的那样)时, 上面粗体提到的,分子和分母中的 double count() 是什么?

I am new to Grafana and Prometheus. I have read a lot of documentation and now I"m trying to work backwards by reviewing some existing queries and making sure I understand them

I have downloaded the Node Exporter Full dashboard (https://grafana.com/grafana/dashboards/1860). I have been reviewing the CPU Busy query and I"m a bit confused. I am quoting it below, spaced out so we can see the nested sections better:

enter image description here

In this query, job is node-exporter while instance is the IP and port of the server. This is my base understanding of the query:
node_cpu_seconds_total is a counter of the number of seconds the CPU took at a given sample.

  1. Line 5: Get cpu seconds at a given instant, broken down by the individual CPU cores
  2. Line 4: Add up all CPU seconds across all cores
  3. Line 3: Why is there an additional count()? Does it do anything?
  4. Line 12: Rate vector - get cpu seconds of when the cpu was idle over the given rate period
  5. Line 11: Take a rate to transfer that into the rate of change of cpu seconds (and return an instant vector)
  6. Line 10: Sum up all rates, broken down by CPU modes
  7. Line 9: Take the single average rate across all CPU mode rates
  8. Line 8: Subtract the average rate of change (Line 9) from total CPU seconds (Line 3)
  9. Line 16: Multiple by 100 to convert minutes to seconds
    10: Line 18-20: Divide Line 19 by the count of the count of all CPU seconds across all CPUs

My questions are as follows:

  • I would have thought that CPU usage would simply be (all non idle cpu usage) / (total cpu usage). I therefore don't understand why take into account rate at all (#6 and #8)
  • The numerator here seems to be trying to get all non-idle usage and does so by getting the full sum and subtracting the idle time. But why does one use count and the other sum?
  • If we grab cpu seconds by filtering by mode=idle, then does adding the by (mode) add anything? There is only one mode anyways? My understanding of by (something) is more relevant when there are multiple values and we group the values by that category (as we do by cpu in this query)
  • Lastly, as mentioned in bold above, what is with the double count(), in the numerator and denominator?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

妞丶爷亲个 2025-01-22 03:16:16

这两个计数函数都会返回 CPU 核心的数量。如果您将它们从这个长查询中取出并执行,它会立即有意义:

count by (cpu) (node_cpu_seconds_total{instance="foo:9100"})

# result:
{cpu="0"} 8
{cpu="1"} 8

通过将上述内容放入另一个 count() 函数中,您将获得值 2,因为数据集中只有 2 个指标。此时,我们可以将原始查询简化为:

(
  NUM_CPU
  -
  avg(
    sum by(mode) (
      rate(node_cpu_seconds_total{mode="idle",instance="foo:9100"}[1m])
    )
  )
  * 100
)
/ NUM_CPU

但是,其余部分有点复杂。这:

    sum by(mode) (
      rate(node_cpu_seconds_total{mode="idle",instance="foo:9100"}[1m])
    )

...本质上是所有 CPU 核心的空闲时间总和(我故意跳过时间上下文以使其更简单)。目前还不清楚为什么会有by (mode),因为rate函数里面有一个过滤器,这使得只有idle模式出现成为可能。无论有没有 by (mode),它都只返回一个值:

# with by (mode)
{mode="idle"} 0.99

# without
{} 0.99

avg() 除此之外根本没有任何意义。 我假设,目的是获取每个 CPU 的空闲时间量(通过 ( cpu),即)。在这种情况下,它开始有意义,尽管它仍然没有必要复杂。因此,此时我们可以将查询简化为:

(NUM_CPU - IDLE_TIME_TOTAL * 100) / NUM_CPU

我不知道为什么这么复杂,您可以通过如下简单查询得到相同的结果:

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="foo:9100"}[1m])))

Both of these count functions return the amount of CPU cores. If you take them out of this long query and execute, it'll immediately make sense:

count by (cpu) (node_cpu_seconds_total{instance="foo:9100"})

# result:
{cpu="0"} 8
{cpu="1"} 8

By putting the above into another count() function, you will get a value of 2, because there are just 2 metrics in the dataset. At this point, we can simplify the original query to this:

(
  NUM_CPU
  -
  avg(
    sum by(mode) (
      rate(node_cpu_seconds_total{mode="idle",instance="foo:9100"}[1m])
    )
  )
  * 100
)
/ NUM_CPU

The rest, however, is somewhat complicated. This:

    sum by(mode) (
      rate(node_cpu_seconds_total{mode="idle",instance="foo:9100"}[1m])
    )

... is essentially the sum of idle time of all CPU cores (I'm intentionally skipping the context of time to make it simpler). It's not clear why there is by (mode), since the rate function inside has a filter, which makes it possible for only idle mode to appear. With or without by (mode) it returns just one value:

# with by (mode)
{mode="idle"} 0.99

# without
{} 0.99

avg() on top of that makes no sense at all. I assume, that the intention was to get the amount of idle time per CPU (by (cpu), that is). In this case it starts to make sense, although it is still unnecessary complex. Thus, at this point we can simplify the query to this:

(NUM_CPU - IDLE_TIME_TOTAL * 100) / NUM_CPU

I don't know why it is so complicated, you can get the same result with a simple query like this:

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="foo:9100"}[1m])))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文