在 awk 中预填充关联数组键?

发布于 2024-12-06 23:55:45 字数 1475 浏览 0 评论 0原文

我编写了一个 munin 插件,它使用 slurm 的 sacct 来监视 HPC 集群上的作业状态。我用 sh + awk (而不是我通常选择的工具 perl)编写它。

该脚本有效,但我花了很长时间才弄清楚如何预先填充可能状态的关联数组(一些/大多数可能不存在于 sacct 输出中,我希望它们默认为零)。谷歌并没有提供太多帮助,我能想到的最好的办法就是在字符串上使用 split 来生成一个临时数组,然后我对其进行迭代。

我想出了这个:

BEGIN {
    num = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
    for (i=1;i<=num;i++) {
        states[statenames[i]] = 0
    }
  }

这可行,但与我在 perl 中的做法相比似乎很笨拙,如下所示:

foreach (qw(cancelled completed completing failed nodefail pending running suspended timeout)) {
    $states{$_} = 0;
}

或者

%states = map {$_ => 0} qw(cancelled completed completing failed nodefail pending running suspended timeout);

我的问题是:是否有一种在 awk 中执行此操作的方法与任一 perl 版本类似?

[编辑]

为了澄清,这是我通过管道传输到 awk 的 sacct 输出的示例。请注意,此输出中的唯一状态是 RUNNING、COMPLETED 和 CANCELED - 其他状态不存在(因为它们今天没有发生),但我希望它们出现在我的脚本的输出中(以 munin 可用的形式作为“状态名称.值 0")。

# sacct -X -P -o 'state' -n
RUNNING
RUNNING
RUNNING
RUNNING
COMPLETED
RUNNING
COMPLETED
RUNNING
COMPLETED
COMPLETED
CANCELLED by 1000
COMPLETED

[再次编辑]

这是我的 munin 插件的示例输出:

# ./slurm-sacct
suspended.value 0
pending.value 0
nodefail.value 0
failed.value 0
running.value 6
completing.value 0
completed.value 5
timeout.value 0
cancelled.value 1

脚本运行并执行我想要的操作,我只是想知道是否有更好的方法来初始化关联数组。

I've written a munin plugin that uses slurm's sacct to monitor job states on a HPC cluster. I've written it in sh + awk (rather than my usual tool of choice, perl).

The script works, but it took me ages to figure out how to pre-populate the associative array of possible states (some/most may not be present in sacct output, and i want them to default to zero). Google wasn't much help, and the best I could come up with was to use split on a string to produce a temporary array, which I then iterated over.

I came up with this:

BEGIN {
    num = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
    for (i=1;i<=num;i++) {
        states[statenames[i]] = 0
    }
  }

This works, but seems clumsy compared to how i'd do it in perl, like this:

foreach (qw(cancelled completed completing failed nodefail pending running suspended timeout)) {
    $states{$_} = 0;
}

or this

%states = map {$_ => 0} qw(cancelled completed completing failed nodefail pending running suspended timeout);

my question is: is there a way of doing this in awk that is similar to either of the perl versions?

[ edited ]

to clarify, here's a sample of the sacct output i'm piping into awk. Note that the only states in this output are RUNNING, COMPLETED, and CANCELLED - the others don't exist (because they haven't occurred today), but i want them in my script's output anyway (in a form usable by munin as "statename.value 0").

# sacct -X -P -o 'state' -n
RUNNING
RUNNING
RUNNING
RUNNING
COMPLETED
RUNNING
COMPLETED
RUNNING
COMPLETED
COMPLETED
CANCELLED by 1000
COMPLETED

[ edited again ]

and here's sample output from my munin plugin:

# ./slurm-sacct
suspended.value 0
pending.value 0
nodefail.value 0
failed.value 0
running.value 6
completing.value 0
completed.value 5
timeout.value 0
cancelled.value 1

The script runs and does what I want, I just wanted to know if there was a better way to initialise the associative array.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

百善笑为先 2024-12-13 23:55:45

您可能根本不需要这样做。 awk 中的变量是动态的,这意味着它们在第一次使用时会自动初始化(分配给或访问),这也适用于数组元素。

如果在数字上下文中访问变量,则该变量将被初始化为 0,否则将被初始化为空字符串。 (至少 gawk 做到了这一点,尽管我不确定它是否依赖于实现)因此,如果您正在执行诸如计算每个状态下的作业数量之类的操作,则整个程序就像

{ states[$1]++ }
END {
     for (state in states) print state, states[state]
}

每次执行表达式 states[$1]++ 时,它会检查 states[$1] 是否存在,如果不存在则将其初始化为 0。


编辑:根据您的评论,我猜您想为每个可能的状态打印一行,无论该状态是否有任何工作。在这种情况下,您需要包含所有可能的状态名称,并且没有 Perl 中那样的快捷表示法来执行此操作。据我所知,您已经发现的内容已经非常干净了。 (Awk 的设计并没有真正考虑到这种用法)

我建议如下:

{ states[$1]++ }
END {
     split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
     for (state in statenames) print state, states[state]+0
}

You probably don't need to do it at all. Variables in awk are dynamic, which means they're automatically initialized when they are first used (either assigned to or accessed), and this applies to array elements as well.

A variable will be initialized to 0 if it's accessed in a numeric context, or to the empty string otherwise. (At least gawk does this, though I'm not sure if it's implementation-dependent) So if you're doing something like counting the number of jobs that are in each state, the entire program is as simple as something like

{ states[$1]++ }
END {
     for (state in states) print state, states[state]
}

Each time the expression states[$1]++ is executed, it will check for the existence of states[$1] and initialize it to 0 if it doesn't already exist.


EDIT: From your comment I'm guessing you want to print out a line for each possible state, regardless of whether there are any jobs in that state or not. In that case, you need to include all the possible state names, and there is no shortcut notation for doing so as there is in Perl. As far as I know, what you've already found is about as clean as it gets. (Awk is not really designed with that usage in mind)

I'd suggest the following:

{ states[$1]++ }
END {
     split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
     for (state in statenames) print state, states[state]+0
}
左岸枫 2024-12-13 23:55:45

代替

print "Timeout states ",states[timeout],".";

也许 Craig 可以使用 : this:

print "Timeout states ",int(states[timeout]),".";

:在我的例子中,如果 awk 输入中没有超时状态,则第一个打印将给出:

超时状态。

而第二个将给出:

超时状态 0。

Perhaps Craig can use instead of :

print "Timeout states ",states[timeout],".";

this:

print "Timeout states ",int(states[timeout]),".";

In my case if there is no timeout state in awk input, the first print will give:

Timeout states .

While the second will give:

Timeout states 0.

入画浅相思 2024-12-13 23:55:45

我认为 awk 中更自然的方法是拥有一个单独的密钥文件。考虑一个文件 keys.txt,每行一个键。然后,您可以执行如下操作:

printf "key1\nkey2\nkey2\nkey5" | 
  awk '
    FILENAME == "keys.txt" {
      counts[$0] = 0
      next
    }

    {
      counts[$0]++
    }

    END {
      for (key in counts) {
        print key, counts[key]
      }
    }' keys.txt -

对于 keys.txt 中的五个键,将产生:

key1 1
key2 2
key3 0
key4 0
key5 1

虽然此处按顺序显示键,但这只是偶然的,不应依赖。

对于特定示例,您还可以完全跳过关联数组。相反,您可以使用 awk 最少地处理这些行并使用 sort | uniq -c 将计数制成表格。可以使用针对密钥文件的 join 来确保所有密钥的存在。

I think a more natural approach in awk would be to have a separate file of keys. Consider a file keys.txt with one key per line. You could then do something like this:

printf "key1\nkey2\nkey2\nkey5" | 
  awk '
    FILENAME == "keys.txt" {
      counts[$0] = 0
      next
    }

    {
      counts[$0]++
    }

    END {
      for (key in counts) {
        print key, counts[key]
      }
    }' keys.txt -

With five keys in keys.txt, this produces:

key1 1
key2 2
key3 0
key4 0
key5 1

Although the keys are shown in order here, that's just incidental and shouldn't be relied upon.

For the specific example, you could also skip the associative array altogether. Instead, you could minimally process the lines with awk and use sort | uniq -c to tabulate the counts. The presence of all keys could be ensured using join against a file of keys.

莫言歌 2024-12-13 23:55:45

awk 比 Perl 有点笨拙(我想说“不太简洁”)。

你可以这样写(类似于@Michael的答案):

pipeline of data |
awk '
  NR == FNR {statenames[$1]=0; next}
  { usual processing }
  END { usual output }
' <(printf "%s\n" cancelled completed completing failed nodefail pending running suspended timeout) -

awk is somewhat clumsier (I would say "less terse") than Perl.

You could write this (similar to @Michael's answer):

pipeline of data |
awk '
  NR == FNR {statenames[$1]=0; next}
  { usual processing }
  END { usual output }
' <(printf "%s\n" cancelled completed completing failed nodefail pending running suspended timeout) -
我为君王 2024-12-13 23:55:45

对 @DavidZaslavsky 答案的一项调整可能是按照您在 split() 行上指定的顺序打印状态。那将是:

{ states[tolower($1)]++ }
END {
     n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
     for (i=1; i<=n; i++) {
         state = statenames[i]
         print state, states[state]+0
     }
}

我还将输入转换为小写,以便它与您的硬编码值匹配,摆脱了 split() 不必要的第三个参数以及后续的 null 语句(尾随分号)。

如果您想在输入中查找不在硬编码集中的州名称,您可以将其调整为:

{ states[tolower($1)]++ }
END {
     n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
     for (i=1; i<=n; i++) {
         state = statenames[i]
         print state, states[state]+0
         delete states[state]
     }
     for (state in states) {
         print "WARNING: found new state name %s\n",state | "cat>&2"
         print state, states[state]+0
     }
}

One tweak to @DavidZaslavsky's answer might be to print the states in the order you specified them on the split() line. That would be:

{ states[tolower($1)]++ }
END {
     n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
     for (i=1; i<=n; i++) {
         state = statenames[i]
         print state, states[state]+0
     }
}

I also converted the input to lower case so it matches your hard-coded values, got rid of the unnecessary 3rd arg to split() and the subsequent null statement (trailing semi-colon).

In case you want to account for finding state names in your input that weren't in your hard-coded set, you could tweak it to:

{ states[tolower($1)]++ }
END {
     n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
     for (i=1; i<=n; i++) {
         state = statenames[i]
         print state, states[state]+0
         delete states[state]
     }
     for (state in states) {
         print "WARNING: found new state name %s\n",state | "cat>&2"
         print state, states[state]+0
     }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文