为什么正则表达式不“贪婪”模式工作？

发布于 2025-01-02 06:01:18 字数 977 浏览 2 评论 0原文

我不明白这种行为。我有这样的例子，需要捕获html注释。

var str = '.. <!--My -- comment test--> ';

var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;

alert(str.match(regex1));      // null
alert(str.match(regex2));      // <!--My -- comment test-->

第二个正则表达式 regex2 工作正常，准确输出所需的内容。第一个显示 null。我不明白其中的区别。正则表达式和含义相同 - "在 < 之后;!-- 取除换行符之外的任意字符，数量从 0 到尽可能多，并以 --> 结尾”。但对于第二个它有效，而对于第一个则无效。为什么？

UPD。我已阅读评论并有更新。

var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';

var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;

alert(str.match(regex3));         // <!--Mycommentstest-->
alert(str.match(regex4));         // <!-- My comment test -->

因此可以使用有限的匹配变量来匹配任何内容。那么应该使用哪种方式来正确使用 RegExps 呢？有 [] 还是没有？无法得到差异，两者都给出正确的输出。

原文

I do not understand the behavior. I have such example, need to catch html comment.

var str = '.. <!--My -- comment test--> ';

var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;

alert(str.match(regex1));      // null
alert(str.match(regex2));      // <!--My -- comment test-->

The second regex regex2 works fine, outputs exactly what's needed. The first shows null. And I don't understand the difference. RegExpressions  and  mean the same - "after ". But for the second it works and for the first does not. Why?

UPD.
I've read comments and have an update.

var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';

var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;

alert(str.match(regex3));         // <!--Mycommentstest-->
alert(str.match(regex4));         // <!-- My comment test -->

So it's possible to use limited matching variables to match anything. So which way should be used to use RegExps right way? With [] or without them? Can't get the difference, both give the right output.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小猫一只 2025-01-09 06:01:18

点 (.) 并不表示字符类中的“任何内容”。为什么需要一个字符类来匹配任何东西？

回复收藏 0 原文

书间行客 2025-01-09 06:01:18

像 \w、\d 和 \s 这样的字符类简写在字符类内部的含义与 out 完全相同，但是像 这样的元字符。 通常会失去其在字符类中的特殊含义。这就是为什么 // 没有按您的预期工作：[.] 匹配文字 .< /代码>。

但是 // 也不起作用，因为 . 不匹配换行符。在大多数正则表达式风格中，您将使用单行模式让点匹配所有字符包括换行符，如下所示：//s 或这样：<代码>(?s)。但 JavaScript 不支持该功能，因此大多数人使用 [\s\S] 代替，意思是“任何空白字符或任何非空白字符”——换句话说，任何字符。

但这也不正确，因为（正如杰森在评论中指出的那样）它将贪婪地匹配从第一个的所有内容，其中可能包含多个个人评论以及它们之间的所有非评论材料。要使其真正正确可能不值得付出努力。当使用正则表达式来匹配 HTML 时，无论如何你都必须做出许多简化的假设；如果你不能假设达到一定程度的格式良好，那么你还不如放弃。在这种情况下，使量词变得非贪婪就足够了：

var regex5 = /<!--[\s\S]*?-->/g;

Character class shorthands like \w, \d and \s mean exactly the same inside character classes as out, but metacharacters like . typically lose their special meanings inside character classes. That's why // didn't work as you expected: [.] matches a literal ..

But // doesn't really work either, since . doesn't match newlines. In most regex flavors you would use single-line mode to let the dot match all characters including newlines, like this: //s or this: (?s). But JavaScript doesn't support that feature, so most people use [\s\S] instead, meaning "any whitespace character or any character that's not whitespace"--in other words, any character.

But that's not right either, since (as Jason pointed out in his comment) it will greedily match everything from the first , which could encompass several individual comments and all the non-comment material between them. To make it truly correct is probably not worth the effort. When using regexes to match HTML, you have to make many simplifying assumptions anyway; if you can't assume a certain level of well-formedness, you might as well give up. In this case, it should suffice to make the quantifier non-greedy: