Awk/etc.:从文件中提取匹配项

发布于 2024-07-24 08:19:12 字数 494 浏览 4 评论 0原文

我有一个 HTML 文件,想要提取

  • 标记之间的文本。 当然有一百万种方法可以做到这一点,但我认为更多地养成在简单的 shell 命令中执行此操作的习惯会很有用:

    awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
    

    问题是,这会打印所有内容,而我只想打印括号中的匹配项 -- ([^>]+) -- 要么 awk 不支持这一点,要么我不称职。 后者的可能性似乎更大。 如果您想将提供的正则表达式应用于文件并仅提取指定的匹配项,您会怎么做? 我已经知道六种其他方法,但我不想让 awk 赢得这一轮;)

    编辑:数据结构不佳,因此使用位置匹配 ($1, $2 等)是不行的。

    I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:

    awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
    

    The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)

    Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.

    如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

    扫码二维码加入Web技术交流群

    发布评论

    需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

    评论(5

    何以心动 2024-07-31 08:19:12

    如果您想在一般情况下执行此操作,其中列表标记可以包含任何合法的 HTML 标记,那么 awk 是错误的工具。 适合这项工作的工具是 HTML 解析器,您可以相信它可以正确获取 HTML 解析的所有小细节,包括 HTML 的变体和格式错误的 HTML。

    如果您这样做是为了特殊情况,您可以控制 HTML 格式,那么您也许可以让 awk 为您工作。 例如,假设您可以保证每个列表元素永远不会占用超过一行,始终以同一行上的 结尾,并且永远不包含任何标记(例如包含列表),那么您可以使用 awk 来执行此操作,但是您需要编写整个 awk 程序,该程序首先查找包含列表元素的行,然后使用其他 < code>awk 命令来查找您感兴趣的子字符串。

    但一般来说,awk 是不适合这项工作的工具。

    If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.

    If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.

    But in general, awk is the wrong tool for this job.

    动次打次papapa 2024-07-31 08:19:12
    gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
    

    对我来说效果很好。

    gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
    

    Worked pretty well for me.

    渔村楼浪 2024-07-31 08:19:12

    通过你的脚本,如果你能得到你想要的(这意味着

  • 标签在一行中。);
  • $ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
    

    $ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
    

    第一个适用于每个 awk,第二个适用于 gnu awk。

    By your script, if you can get what you want (it means <li> and <a> tag is in one line.);

    $ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
    

    or

    $ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
    

    First one is for every awk, second one is for gnu awk.

    信愁 2024-07-31 08:19:12

    我看到了几个问题:

    • 该模式有一个尾随“m”,这对于 Perl 中的多行匹配很重要,但 Awk 不使用与 Perl 兼容的正则表达式。 (至少,标准(非 GNU)awk 不会。)
    • 忽略这一点,该模式似乎会搜索“开始列表项”,后跟锚点“”到“<” code>',而不是最终列表项。
    • 您搜索任何不是 '>' 作为锚点主体的内容; 这并不是自动错误的,但搜索不是 '<' 或两者都不是的任何内容可能更常见。
    • awk 不进行多行搜索。
    • 在 Awk 中,“$1”表示第一个字段,其中字段由字段分隔符分隔,默认为空格。
    • 在经典的 nawk 中(如 1991 年出版的“sed & awk”一书所述)没有适当的机制来从匹配中提取子字段等。

    目前尚不清楚 Awk 是否适合这项工作。 事实上,尚不完全清楚正则表达式是否适合这项工作。

    There are several issues that I see:

    • The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
    • Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
    • You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
    • Awk does not do multi-line searches.
    • In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
    • In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.

    It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.

    救星 2024-07-31 08:19:12

    不太了解 awk,那么 Perl 怎么样?

    tr -d '\012' the.html | perl \
    -e '$text = <>;' -e 'while ( length( $text) > 0)' \
    -e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
    

    1) 从文件中删除换行符,通过 perl 进行管道

    2) 使用完整文本初始化变量,启动循环直到文本消失

    3) 对列表项标记范围内的内容进行“非贪婪”匹配,保存并打印目标,为下一次设置

    有意义吗? (警告,我自己没有尝试过这段代码,需要尽快回家...)

    PS - “perl -n”是 Awk(nawk?)模式。 Perl 在很大程度上是 Awk 的超集,所以我从来没有费心去学习 Awk。

    Don't really know awk, how about Perl instead?

    tr -d '\012' the.html | perl \
    -e '$text = <>;' -e 'while ( length( $text) > 0)' \
    -e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
    

    1) remove newlines from file, pipe through perl

    2) initialize a variable with the complete text, start a loop until text is gone

    3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass

    Make sense? (warning, did not try this code myself, need to go home soon...)

    P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.

    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文