AWK:从线条模式访问捕获的组

发布于 2024-09-03 20:33:47 字数 95 浏览 3 评论 0原文

如果我有一个 awk 命令

pattern { ... }

并且模式使用捕获组,我如何访问块中捕获的字符串?

If I have an awk command

pattern { ... }

and pattern uses a capturing group, how can I access the string so captured in the block?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

南冥有猫 2024-09-10 20:33:47

通过 gawk,您可以使用 match 函数来捕获带括号的组。

gawk 'match($0, pattern, ary) {print ary[1]}' 

示例:

echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}' 

输出cd

请注意 gawk 的具体用法,它实现了相关功能。

对于便携式替代方案,您可以使用 match()substr 获得类似的结果。

示例:

echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

输出cd

With gawk, you can use the match function to capture parenthesized groups.

gawk 'match($0, pattern, ary) {print ary[1]}' 

example:

echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}' 

outputs cd.

Note the specific use of gawk which implements the feature in question.

For a portable alternative you can achieve similar results with match() and substr.

example:

echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

outputs cd.

南笙 2024-09-10 20:33:47

那是一段回忆……

我很久以前就用 Perl 取代了 awk。

显然 AWK 正则表达式引擎不捕获其组。

你可能会考虑使用类似的东西:

perl -n -e'/test(\d+)/ && print $1'

-n 标志使 perl 像 awk 一样循环每一行。

That was a stroll down memory lane...

I replaced awk by perl a long time ago.

Apparently the AWK regular expression engine does not capture its groups.

you might consider using something like :

perl -n -e'/test(\d+)/ && print $1'

the -n flag causes perl to loop over every line like awk does.

靖瑶 2024-09-10 20:33:47

这是我一直需要的东西,所以我为它创建了一个 bash 函数。它基于格伦·杰克曼的回答。

定义

将其添加到您的 .bash_profile 等中。

function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

用法

捕获文件中每一行的正则表达式

$ cat filename | regex '.*'

捕获文件中每一行的第一个正则表达式捕获组

$ cat filename | regex '(.*)' 1

This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.

Definition

Add this to your .bash_profile etc.

function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

Usage

Capture regex for each line in file

$ cat filename | regex '.*'

Capture 1st regex capture group for each line in file

$ cat filename | regex '(.*)' 1
南街女流氓 2024-09-10 20:33:47

您可以使用 GNU awk:

$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]

$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/

You can use GNU awk:

$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]

$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/
绅士风度i 2024-09-10 20:33:47

注意:gensub 的使用不符合 POSIX 标准

您也可以在普通 awk 中模拟捕获,无需扩展。但它并不直观:

步骤 1. 使用 gensub 将匹配项与字符串中未出现的某些字符包围起来。
步骤 2. 对角色使用 split。
步骤 3. 分割数组中的每个其他元素都是您的捕获组。

$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad

NOTE: the use of gensub is not POSIX compliant

You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:

step 1. use gensub to surround matches with some character that doesnt appear in your string.
step 2. Use split against the character.
step 3. Every other element in the splitted array is your capture group.

$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad
意中人 2024-09-10 20:33:47

我在想出一个包含 Peter Tillemans 答案的 bash 函数时遇到了一些困难,但这是我想出的:

函数正则表达式
{
perl -n -e "/$1/ && printf \"%s\n\", "'$1'
}

发现对于以下正则表达式参数,这比 opsb 的基于 awk 的 bash 函数效果更好,因为我不希望打印“ms”。

'([0-9]*)ms

I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:

function regex
{
perl -n -e "/$1/ && printf \"%s\n\", "'$1'
}

I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.

'([0-9]*)ms

笑脸一如从前 2024-09-10 20:33:47

我认为 gawk match()-to-array 仅适用于捕获组的第一个实例。

如果您想要捕获多个内容,并对它们执行任何复杂的操作,也许

gawk 'BEGIN { S = SUBSEP 
          } { 
              nx=split(gensub(/(..(..)..(..))/, 
                              "\\1"(S)"\\2"(S)"\\3", "g", str), 
                       arr, S)
              for(x in nx) { perform-ops-over arr[x] } }'

这样您就不会受到 gensub() 的限制,这会限制您的修改的复杂性,或者通过match()

通过纯粹的反复试验,我注意到关于 unicode 模式下的 gawk 的一个警告:对于有效的 unicode 字符串 뀇꿬 ,其 6 个八进制代码如下所示:

场景 1:匹配单个字节没问题,但也会向您报告 1 的多字节 RSTART,而不是 2 的字节级答案。它也不会提供有关是否 \207 是第一个连续字节,或第二个连续字节,因为 RLENGTH 这里始终为 1。


$ gawk 'BEGIN{ print match("\353\200\207\352\277\254", "\207") }' 
$ 1 

场景 2:匹配也适用于这样的 unicode 无效模式

$ gawk 'BEGIN{ match("\353\200\207\352\277\254", "\207\352"); 
$                print RSTART, RLENGTH }' 
$ 1 2

场景 3:您可以检查是否存在针对 unicode 非法字符串的模式(\300 \xC0 对于所有可能的字节对来说 UTF8 无效)

$ gawk 'BEGIN{ print ("\300\353\200\207\352\277\254" ~ /\200/) }' 
$ 1

场景 4/5/6:错误消息将显示为 (a) match() 与 unicode-invalid 字符串,index() 对于任一参数unicode 无效/不完整。

$ gawk 'BEGIN{ match("\300\353\200\207\352\277\254", "\207\352"); print RSTART, RLENGTH }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 2 2

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\352") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\200") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

i think gawk match()-to-array is only for first instance of the capture group.

if there are multiple things you'd like to capture, and perform any complex operations upon them, perhaps

gawk 'BEGIN { S = SUBSEP 
          } { 
              nx=split(gensub(/(..(..)..(..))/, 
                              "\\1"(S)"\\2"(S)"\\3", "g", str), 
                       arr, S)
              for(x in nx) { perform-ops-over arr[x] } }'

This way you aren't constrained by either gensub(), which limits the complexity if your modifications, or by match().

by pure trial-and-error, one caveat i've noted about gawk in unicode mode : for a valid unicode string 뀇꿬 with the 6 octal codes listed below :

Scenario 1 : matching individual bytes are fine, but will also report you the multi-byte RSTART of 1 instead of a byte-level answer of 2. It also won't provide info on whether \207 is the 1st continuation byte, or the second one, since RLENGTH will always be 1 here.

$ gawk 'BEGIN{ print match("\353\200\207\352\277\254", "\207") }' 
$ 1 

Scenario 2 : Match also works against unicode-invalid patterns like this

$ gawk 'BEGIN{ match("\353\200\207\352\277\254", "\207\352"); 
$                print RSTART, RLENGTH }' 
$ 1 2

Scenario 3 : you can check for existence of a pattern against a unicode-illegal string (\300 \xC0 is UTF8-invalid for all possible byte pairings)

$ gawk 'BEGIN{ print ("\300\353\200\207\352\277\254" ~ /\200/) }' 
$ 1

Scenarios 4/5/6 : the error message will show up for either (a) match() with unicode-invalid string, index() for either argument to be unicode-invalid/incomplete.

$ gawk 'BEGIN{ match("\300\353\200\207\352\277\254", "\207\352"); print RSTART, RLENGTH }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 2 2

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\352") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\200") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文