这是非常基本的问题。所以我被告知是一个点。匹配除换行符之外的任何字符。我正在寻找与任何字符匹配的内容,包括换行符。
我想要做的就是捕获网站页面中两个特定字符串之间的所有文本,去掉页眉和页脚。像 HEADER TEXT(.+)FOOTER TEXT 这样的东西,然后提取括号中的内容,但我找不到一种方法来包含页眉和页脚之间的所有文本和换行符,这有意义吗?提前致谢!
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
发布评论
评论(7)
当我需要匹配多个字符(包括换行符)时,我会这样做:
注意我使用的是非贪婪模式
When I need to match several characters, including line breaks, I do:
Note I'm using a non-greedy pattern
您可以使用 Perl 做到这一点:
要仅打印分隔符之间的文本,请使用
/s
开关使正则表达式匹配器将整个字符串视为单行,这意味着点匹配换行符,而/g
意味着匹配尽可能多的次数。上面的示例假设您正在处理本地磁盘上的 HTML 文件。如果您需要先获取它们,请使用 get rel="nofollow noreferrer">
LWP::Simple
:请注意,使用上述正则表达式解析 HTML在一般情况下不起作用!如果您正在开发一个快速而肮脏的扫描器,很好,但是对于需要更健壮的应用程序,请使用真正的解析器。
You could do it with Perl:
To print only the text between the delimiters, use
The
/s
switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and/g
means match as many times as possible.The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use
get
fromLWP::Simple
:Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
根据定义,
grep
查找匹配的行;它读取一行,查看是否匹配,然后打印该行。一种可能的方法是使用 sed 来完成您想要的操作:
这会从与“HEADER TEXT”匹配的第一行打印到与“FOOTER TEXT”匹配的第一行,然后进行迭代; “-n”停止默认的“打印每行”操作。如果页眉和页脚文本出现在同一行,则此方法效果不佳。
为了完成您想要的操作,我可能会使用
perl
(但如果您愿意,也可以使用 Python)。我会考虑读取整个文件,然后使用适当限定的正则表达式来查找文件的匹配部分。然而,“@gbacon”给出的 Perl 一行代码几乎是上面“sed”脚本的 Perl 精确音译,并且比 slurp 更简洁。By definition,
grep
looks for lines which match; it reads a line, sees whether it matches, and prints the line.One possible way to do what you want is with
sed
:This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use
perl
(but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '@gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.grep 的手册页显示:
grep
不适用于匹配多行。您应该尝试使用perl
或awk
来解决此任务。The man page of
grep
says:grep
is not made for matching more than a single line. You should try to solve this task withperl
orawk
.匹配换行符
由于它被标记为“bbedit”并且 BBedit 支持 Perl 样式模式修饰符,因此您可以允许点与开关 (?s) (?s)
。将匹配任何字符。是的,
(?s).+
将匹配整个文本。
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
正如其他地方所指出的,grep 适用于单行内容。
对于多行(在 ruby 中使用 Regexp::MULTILINE,或者在 python、awk、sed 等中),“\s”也应该捕获换行符,所以
可能会工作...
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
might work ...
如果你有的话,这是使用 gawk 的一种方法
here's one way to do it with gawk, if you have it