Grep 中有真正通用的通配符吗?

发布于 2024-08-14 18:34:28 字数 196 浏览 13 评论 0 原文

这是非常基本的问题。所以我被告知是一个点。匹配除换行符之外的任何字符。我正在寻找与任何字符匹配的内容,包括换行符。

我想要做的就是捕获网站页面中两个特定字符串之间的所有文本,去掉页眉和页脚。像 HEADER TEXT(.+)FOOTER TEXT 这样的东西,然后提取括号中的内容,但我找不到一种方法来包含页眉和页脚之间的所有文本和换行符,这有意义吗?提前致谢!

Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.

All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

·深蓝 2024-08-21 18:34:29

当我需要匹配多个字符(包括换行符)时,我会这样做:

[\s\S]*?

注意我使用的是非贪婪模式

When I need to match several characters, including line breaks, I do:

[\s\S]*?

Note I'm using a non-greedy pattern

够钟 2024-08-21 18:34:29

您可以使用 Perl 做到这一点:

$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html

要仅打印分隔符之间的文本,请使用

$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html

/s 开关使正则表达式匹配器将整个字符串视为单行,这意味着点匹配换行符,而 /g 意味着匹配尽可能多的次数。

上面的示例假设您正在处理本地磁盘上的 HTML 文件。如果您需要先获取它们,请使用 get rel="nofollow noreferrer">LWP::Simple

$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
                          print $1 while m!<head>(.+?)</head>!sg'

请注意,使用上述正则表达式解析 HTML在一般情况下不起作用!如果您正在开发一个快速而肮脏的扫描器,很好,但是对于需要更健壮的应用程序,请使用真正的解析器。

You could do it with Perl:

$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html

To print only the text between the delimiters, use

$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html

The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.

The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:

$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
                          print $1 while m!<head>(.+?)</head>!sg'

Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.

兔小萌 2024-08-21 18:34:29

根据定义,grep 查找匹配的行;它读取一行,查看是否匹配,然后打印该行。

一种可能的方法是使用 sed 来完成您想要的操作:

sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$@"

这会从与“HEADER TEXT”匹配的第一行打印到与“FOOTER TEXT”匹配的第一行,然后进行迭代; “-n”停止默认的“打印每行”操作。如果页眉和页脚文本出现在同一行,则此方法效果不佳。

为了完成您想要的操作,我可能会使用 perl (但如果您愿意,也可以使用 Python)。我会考虑读取整个文件,然后使用适当限定的正则表达式来查找文件的匹配部分。然而,“@gbacon”给出的 Perl 一行代码几乎是上面“sed”脚本的 Perl 精确音译,并且比 slurp 更简洁。

By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.

One possible way to do what you want is with sed:

sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$@"

This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.

To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '@gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.

能怎样 2024-08-21 18:34:29

grep 的手册页显示:

grep、egrep、fgrep、rgrep - 打印与模式匹配的行

grep 不适用于匹配多行。您应该尝试使用 perlawk 来解决此任务。

The man page of grep says:

grep, egrep, fgrep, rgrep - print lines matching a pattern

grep is not made for matching more than a single line. You should try to solve this task with perl or awk.

南城追梦 2024-08-21 18:34:29

匹配换行符

由于它被标记为“bbedit”并且 BBedit 支持 Perl 样式模式修饰符,因此您可以允许点与开关 (?s) (?s)

。将匹配任何字符。是的,
(?s).+
将匹配整个文本。

As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)

(?s).

will match ANY character. And yes,
(?s).+
will match the whole text.

滴情不沾 2024-08-21 18:34:29

正如其他地方所指出的,grep 适用于单行内容。

对于多行(在 ruby​​ 中使用 Regexp::MULTILINE,或者在 python、awk、sed 等中),“\s”也应该捕获换行符,所以

HEADER TEXT(.*\s*)FOOTER TEXT 

可能会工作...

As pointed elsewhere, grep will work for single line stuff.

For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so

HEADER TEXT(.*\s*)FOOTER TEXT 

might work ...

氛圍 2024-08-21 18:34:29

如果你有的话,这是使用 gawk 的一种方法

awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

here's one way to do it with gawk, if you have it

awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文