为什么正则表达式不“贪婪”模式工作?

发布于 2025-01-02 06:01:18 字数 977 浏览 2 评论 0原文

我不明白这种行为。我有这样的例子,需要捕获html注释。

var str = '.. <!--My -- comment test--> ';

var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;

alert(str.match(regex1));      // null
alert(str.match(regex2));      // <!--My -- comment test--> 

第二个正则表达式 regex2 工作正常,准确输出所需的内容。第一个显示 null。我不明白其中的区别。正则表达式 含义相同 - "在 < 之后;!-- 取除换行符之外的任意字符,数量从 0 到尽可能多,并以 --> 结尾”。但对于第二个它有效,而对于第一个则无效。为什么?

UPD。 我已阅读评论并有更新。

var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';

var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;

alert(str.match(regex3));         // <!--Mycommentstest-->
alert(str.match(regex4));         // <!-- My comment test -->

因此可以使用有限的匹配变量来匹配任何内容。那么应该使用哪种方式来正确使用 RegExps 呢?有 [] 还是没有?无法得到差异,两者都给出正确的输出。

I do not understand the behavior. I have such example, need to catch html comment.

var str = '.. <!--My -- comment test--> ';

var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;

alert(str.match(regex1));      // null
alert(str.match(regex2));      // <!--My -- comment test--> 

The second regex regex2 works fine, outputs exactly what's needed. The first shows null. And I don't understand the difference. RegExpressions <!--[.]*--> and <!--.*--> mean the same - "after <!-- take ANY character except newline in quantity from 0 to as many as possible and finish with -->". But for the second it works and for the first does not. Why?

UPD.
I've read comments and have an update.

var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';

var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;

alert(str.match(regex3));         // <!--Mycommentstest-->
alert(str.match(regex4));         // <!-- My comment test -->

So it's possible to use limited matching variables to match anything. So which way should be used to use RegExps right way? With [] or without them? Can't get the difference, both give the right output.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

小猫一只 2025-01-09 06:01:18

点 (.) 并不表示字符类中的“任何内容”。为什么需要一个字符类来匹配任何东西?

The dot (.) does not mean "anything" inside a character class. Why would you need a character class to match anything?

书间行客 2025-01-09 06:01:18

\w\d\s 这样的字符类简写在字符类内部的含义与 out 完全相同,但是像 这样的元字符。 通常会失去其在字符类中的特殊含义。这就是为什么 // 没有按您的预期工作:[.] 匹配文字 .< /代码>。

但是 // 也不起作用,因为 . 不匹配换行符。在大多数正则表达式风格中,您将使用单行模式让点匹配所有字符包括换行符,如下所示://s 或这样:<代码>(?s)。但 JavaScript 不支持该功能,因此大多数人使用 [\s\S] 代替,意思是“任何空白字符或任何空白字符”——换句话说,任何字符。

但这也不正确,因为(正如杰森在评论中指出的那样)它将贪婪地匹配从第一个 的所有内容,其中可能包含多个个人评论以及它们之间的所有非评论材料。要使其真正正确可能不值得付出努力。当使用正则表达式来匹配 HTML 时,无论如何你都必须做出许多简化的假设;如果你不能假设达到一定程度的格式良好,那么你还不如放弃。在这种情况下,使量词变得非贪婪就足够了:

var regex5 = /<!--[\s\S]*?-->/g;

Character class shorthands like \w, \d and \s mean exactly the same inside character classes as out, but metacharacters like . typically lose their special meanings inside character classes. That's why /<!--[.]*-->/ didn't work as you expected: [.] matches a literal ..

But /<!--.*-->/ doesn't really work either, since . doesn't match newlines. In most regex flavors you would use single-line mode to let the dot match all characters including newlines, like this: /<!--.*-->/s or this: (?s)<!--.*-->. But JavaScript doesn't support that feature, so most people use [\s\S] instead, meaning "any whitespace character or any character that's not whitespace"--in other words, any character.

But that's not right either, since (as Jason pointed out in his comment) it will greedily match everything from the first <!-- to the last -->, which could encompass several individual comments and all the non-comment material between them. To make it truly correct is probably not worth the effort. When using regexes to match HTML, you have to make many simplifying assumptions anyway; if you can't assume a certain level of well-formedness, you might as well give up. In this case, it should suffice to make the quantifier non-greedy:

var regex5 = /<!--[\s\S]*?-->/g;
风筝有风,海豚有海 2025-01-09 06:01:18

RegExpressions 含义相同

这是不正确的。

括号[]表示字符类,可以匹配该类中的任何字符。 [.] 是包含“.”字符的字符类。将此与 . 进行对比,后者是一个预定义字符类,表示“任何字符”(行终止符除外)。

因此,您与 匹配的内容要么是空注释,要么是完全由“.”字符组成的注释。与 匹配的内容要么是空注释,要么是用除换行符之外的任何字符填充的注释。

RegExpressions <!--[.]*--> and <!--.*--> mean the same

This is not correct.

The brackets [] indicate a character class, where any character in the class may be matched. [.] is the character class which contains the '.' character. Contrast this with ., which is a pre-defined character class taken to mean "any character" (except for line-terminators).

So what you're matching with <!--[.]*--> is either an empty comment or a comment consisting wholly of '.' characters. And what you're matching with <!--.*--> is either an empty comment or a comment filled with any character except linebreaks.

白馒头 2025-01-09 06:01:18

第一个不是,因为它的意思不一样。第一个表示匹配句点字符。当句点字符放入 [] 集中时,它不是通用匹配项。 (如果您考虑一下,这是有道理的:为什么您要匹配一组有限匹配变量中的任何内容)

The first doesn't because it doesn't mean the same. The first means to match the period character. The period character isn't a generic match when put inside of a [] set. (and if you think about it, this makes sense: why would you want to match anything inside a set of limited matching variables)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文