为什么正则表达式不“贪婪”模式工作?
我不明白这种行为。我有这样的例子,需要捕获html注释。
var str = '.. <!--My -- comment test--> ';
var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;
alert(str.match(regex1)); // null
alert(str.match(regex2)); // <!--My -- comment test-->
第二个正则表达式 regex2 工作正常,准确输出所需的内容。第一个显示 null
。我不明白其中的区别。正则表达式 和
含义相同 - "在
< 之后;!--
取除换行符之外的任意字符,数量从 0 到尽可能多,并以 -->
结尾”。但对于第二个它有效,而对于第一个则无效。为什么?
UPD。 我已阅读评论并有更新。
var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';
var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;
alert(str.match(regex3)); // <!--Mycommentstest-->
alert(str.match(regex4)); // <!-- My comment test -->
因此可以使用有限的匹配变量来匹配任何内容。那么应该使用哪种方式来正确使用 RegExps 呢?有 []
还是没有?无法得到差异,两者都给出正确的输出。
I do not understand the behavior. I have such example, need to catch html comment.
var str = '.. <!--My -- comment test--> ';
var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;
alert(str.match(regex1)); // null
alert(str.match(regex2)); // <!--My -- comment test-->
The second regex regex2
works fine, outputs exactly what's needed. The first shows null
. And I don't understand the difference. RegExpressions <!--[.]*-->
and <!--.*-->
mean the same - "after <!--
take ANY character except newline in quantity from 0 to as many as possible and finish with -->
". But for the second it works and for the first does not. Why?
UPD.
I've read comments and have an update.
var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';
var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;
alert(str.match(regex3)); // <!--Mycommentstest-->
alert(str.match(regex4)); // <!-- My comment test -->
So it's possible to use limited matching variables to match anything. So which way should be used to use RegExps right way? With []
or without them? Can't get the difference, both give the right output.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
点 (
.
) 并不表示字符类中的“任何内容”。为什么需要一个字符类来匹配任何东西?The dot (
.
) does not mean "anything" inside a character class. Why would you need a character class to match anything?像
\w
、\d
和\s
这样的字符类简写在字符类内部的含义与 out 完全相同,但是像这样的元字符。
通常会失去其在字符类中的特殊含义。这就是为什么//
没有按您的预期工作:[.]
匹配文字.< /代码>。
但是
//
也不起作用,因为.
不匹配换行符。在大多数正则表达式风格中,您将使用单行模式让点匹配所有字符包括换行符,如下所示://s 或这样:<代码>(?s)
。但 JavaScript 不支持该功能,因此大多数人使用[\s\S]
代替,意思是“任何空白字符或任何非空白字符”——换句话说,任何字符。但这也不正确,因为(正如杰森在评论中指出的那样)它将贪婪地匹配从第一个
的所有内容,其中可能包含多个个人评论以及它们之间的所有非评论材料。要使其真正正确可能不值得付出努力。当使用正则表达式来匹配 HTML 时,无论如何你都必须做出许多简化的假设;如果你不能假设达到一定程度的格式良好,那么你还不如放弃。在这种情况下,使量词变得非贪婪就足够了:
Character class shorthands like
\w
,\d
and\s
mean exactly the same inside character classes as out, but metacharacters like.
typically lose their special meanings inside character classes. That's why/<!--[.]*-->/
didn't work as you expected:[.]
matches a literal.
.But
/<!--.*-->/
doesn't really work either, since.
doesn't match newlines. In most regex flavors you would use single-line mode to let the dot match all characters including newlines, like this:/<!--.*-->/s
or this:(?s)<!--.*-->
. But JavaScript doesn't support that feature, so most people use[\s\S]
instead, meaning "any whitespace character or any character that's not whitespace"--in other words, any character.But that's not right either, since (as Jason pointed out in his comment) it will greedily match everything from the first
<!--
to the last-->
, which could encompass several individual comments and all the non-comment material between them. To make it truly correct is probably not worth the effort. When using regexes to match HTML, you have to make many simplifying assumptions anyway; if you can't assume a certain level of well-formedness, you might as well give up. In this case, it should suffice to make the quantifier non-greedy:这是不正确的。
括号
[]
表示字符类,可以匹配该类中的任何字符。[.]
是包含“.
”字符的字符类。将此与.
进行对比,后者是一个预定义字符类,表示“任何字符”(行终止符除外)。因此,您与
匹配的内容要么是空注释,要么是完全由“
.
”字符组成的注释。与匹配的内容要么是空注释,要么是用除换行符之外的任何字符填充的注释。
This is not correct.
The brackets
[]
indicate a character class, where any character in the class may be matched.[.]
is the character class which contains the '.
' character. Contrast this with.
, which is a pre-defined character class taken to mean "any character" (except for line-terminators).So what you're matching with
<!--[.]*-->
is either an empty comment or a comment consisting wholly of '.
' characters. And what you're matching with<!--.*-->
is either an empty comment or a comment filled with any character except linebreaks.第一个不是,因为它的意思不一样。第一个表示匹配句点字符。当句点字符放入 [] 集中时,它不是通用匹配项。 (如果您考虑一下,这是有道理的:为什么您要匹配一组有限匹配变量中的任何内容)
The first doesn't because it doesn't mean the same. The first means to match the period character. The period character isn't a generic match when put inside of a [] set. (and if you think about it, this makes sense: why would you want to match anything inside a set of limited matching variables)