Awk/etc.:从文件中提取匹配项
我有一个 HTML 文件,想要提取
和
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
问题是,这会打印所有内容,而我只想打印括号中的匹配项 -- ([^>]+)
-- 要么 awk 不支持这一点,要么我不称职。 后者的可能性似乎更大。 如果您想将提供的正则表达式应用于文件并仅提取指定的匹配项,您会怎么做? 我已经知道六种其他方法,但我不想让 awk
赢得这一轮;)
编辑:数据结构不佳,因此使用位置匹配 ($1, $2 等
)是不行的。
I have an HTML file and would like to extract the text between <li>
and </li>
tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+)
-- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk
win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.
) is a no-go.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您想在一般情况下执行此操作,其中列表标记可以包含任何合法的 HTML 标记,那么
awk
是错误的工具。 适合这项工作的工具是 HTML 解析器,您可以相信它可以正确获取 HTML 解析的所有小细节,包括 HTML 的变体和格式错误的 HTML。如果您这样做是为了特殊情况,您可以控制 HTML 格式,那么您也许可以让
awk
为您工作。 例如,假设您可以保证每个列表元素永远不会占用超过一行,始终以同一行上的结尾,并且永远不包含任何标记(例如包含列表),那么您可以使用 awk 来执行此操作,但是您需要编写整个 awk 程序,该程序首先查找包含列表元素的行,然后使用其他 < code>awk
命令来查找您感兴趣的子字符串。但一般来说,
awk
是不适合这项工作的工具。If you want to do this in the general case, where your list tags can contain any legal HTML markup, then
awk
is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make
awk
work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with</li>
on the same line, never contains any markup (such as a list that contains a list), then you can useawk
to do this, but you need to write a wholeawk
program that first finds lines that contain list elements, then uses otherawk
commands to find just the substring you are interested in.But in general,
awk
is the wrong tool for this job.对我来说效果很好。
Worked pretty well for me.
通过你的脚本,如果你能得到你想要的(这意味着
和
标签在一行中。);
或
第一个适用于每个 awk,第二个适用于 gnu awk。
By your script, if you can get what you want (it means
<li>
and<a>
tag is in one line.);or
First one is for every awk, second one is for gnu awk.
我看到了几个问题:
”到“<” code>',而不是最终列表项。
>
' 作为锚点主体的内容; 这并不是自动错误的,但搜索不是 '<
' 或两者都不是的任何内容可能更常见。$1
”表示第一个字段,其中字段由字段分隔符分隔,默认为空格。nawk
中(如 1991 年出版的“sed & awk
”一书所述)没有适当的机制来从匹配中提取子字段等。目前尚不清楚 Awk 是否适合这项工作。 事实上,尚不完全清楚正则表达式是否适合这项工作。
There are several issues that I see:
<a>
' to '</a>
', not the end list item.>
' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<
', or anything that is neither.$1
' denotes the first field, where the fields are separated by the field separator characters, which default to white space.nawk
(as documented in the 'sed & awk
' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
不太了解 awk,那么 Perl 怎么样?
1) 从文件中删除换行符,通过 perl 进行管道
2) 使用完整文本初始化变量,启动循环直到文本消失
3) 对列表项标记范围内的内容进行“非贪婪”匹配,保存并打印目标,为下一次设置
有意义吗? (警告,我自己没有尝试过这段代码,需要尽快回家...)
PS - “perl -n”是 Awk(nawk?)模式。 Perl 在很大程度上是 Awk 的超集,所以我从来没有费心去学习 Awk。
Don't really know awk, how about Perl instead?
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.