Awk/etc.：从文件中提取匹配项

发布于 2024-07-24 08:19:12 字数 494 浏览 4 评论 0原文

我有一个 HTML 文件，想要提取

和

标记之间的文本。当然有一百万种方法可以做到这一点，但我认为更多地养成在简单的 shell 命令中执行此操作的习惯会很有用：

awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html

问题是，这会打印所有内容，而我只想打印括号中的匹配项 -- ([^>]+) -- 要么 awk 不支持这一点，要么我不称职。后者的可能性似乎更大。如果您想将提供的正则表达式应用于文件并仅提取指定的匹配项，您会怎么做？我已经知道六种其他方法，但我不想让 awk 赢得这一轮；）

编辑：数据结构不佳，因此使用位置匹配 ($1, $2 等）是不行的。

原文

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:

awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html

The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)

Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

何以心动 2024-07-31 08:19:12

如果您想在一般情况下执行此操作，其中列表标记可以包含任何合法的 HTML 标记，那么 awk 是错误的工具。适合这项工作的工具是 HTML 解析器，您可以相信它可以正确获取 HTML 解析的所有小细节，包括 HTML 的变体和格式错误的 HTML。

如果您这样做是为了特殊情况，您可以控制 HTML 格式，那么您也许可以让 awk 为您工作。例如，假设您可以保证每个列表元素永远不会占用超过一行，始终以同一行上的 结尾，并且永远不包含任何标记（例如包含列表），那么您可以使用 awk 来执行此操作，但是您需要编写整个 awk 程序，该程序首先查找包含列表元素的行，然后使用其他 < code>awk 命令来查找您感兴趣的子字符串。

但一般来说，awk 是不适合这项工作的工具。

回复收藏 0 原文

动次打次papapa 2024-07-31 08:19:12

gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file

对我来说效果很好。

gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file

Worked pretty well for me.

回复收藏 0 原文

渔村楼浪 2024-07-31 08:19:12

通过你的脚本，如果你能得到你想要的（这意味着

和标签在一行中。）；

$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'

或

$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'

第一个适用于每个 awk，第二个适用于 gnu awk。

By your script, if you can get what you want (it means <li> and <a> tag is in one line.);

$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'

$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'

First one is for every awk, second one is for gnu awk.

回复收藏 0 原文

信愁 2024-07-31 08:19:12

我看到了几个问题：

该模式有一个尾随“m”，这对于 Perl 中的多行匹配很重要，但 Awk 不使用与 Perl 兼容的正则表达式。（至少，标准（非 GNU）awk 不会。）
忽略这一点，该模式似乎会搜索“开始列表项”，后跟锚点“”到“<” code>'，而不是最终列表项。
您搜索任何不是 '>' 作为锚点主体的内容；这并不是自动错误的，但搜索不是 '<' 或两者都不是的任何内容可能更常见。
awk 不进行多行搜索。
在 Awk 中，“$1”表示第一个字段，其中字段由字段分隔符分隔，默认为空格。
在经典的 nawk 中（如 1991 年出版的“sed & awk”一书所述）没有适当的机制来从匹配中提取子字段等。

目前尚不清楚 Awk 是否适合这项工作。事实上，尚不完全清楚正则表达式是否适合这项工作。

回复收藏 0 原文

救星 2024-07-31 08:19:12

不太了解 awk，那么 Perl 怎么样？

tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'

1) 从文件中删除换行符，通过 perl 进行管道

2) 使用完整文本初始化变量，启动循环直到文本消失

3) 对列表项标记范围内的内容进行“非贪婪”匹配，保存并打印目标，为下一次设置

有意义吗？（警告，我自己没有尝试过这段代码，需要尽快回家...）

PS - “perl -n”是 Awk（nawk？）模式。 Perl 在很大程度上是 Awk 的超集，所以我从来没有费心去学习 Awk。

Don't really know awk, how about Perl instead?

tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'

1) remove newlines from file, pipe through perl

2) initialize a variable with the complete text, start a loop until text is gone

3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass

Make sense? (warning, did not try this code myself, need to go home soon...)

P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.

回复收藏 0 原文

~没有更多了~