正则表达式贪婪匹配未按预期工作
我有一个非常基本的正则表达式,我只是不明白为什么它不起作用,所以问题分为两部分。为什么我当前的版本不起作用以及正确的表达方式是什么。
规则非常简单:
- 必须至少有 3 个字符。
- 如果 % 字符是第一个字符,则必须至少包含 4 个字符。
因此,以下情况应如下计算:
- AB - 失败
- ABC - 通过
- ABCDEFG - 通过
- % - 失败
- %AB - 失败
- %ABC - 通过
- %ABCDEFG - 通过
- %%AB - 通过
我使用的表达式是:
^%?\S{3}
这对我来说意味着 : :
^
- 字符串开头%?
- 贪婪检查 0 或 1 % 字符\S{3}
- 3 个其他不存在的字符空白
问题是,由于某种原因 %?
没有进行贪心检查。它不会吃掉 % 字符(如果存在),因此“%AB”案例正在通过,我认为应该失败。为什么 %?
不吃掉 % 字符?
请有人给我指点迷津:)
编辑:我使用的答案是下面的 Dav:^(%\S{3}|[^%\s]\S{2})< /代码> 虽然这是一个由两部分组成的答案,但艾伦的答案确实让我明白了原因。我没有使用他的
^(?>%?)\S{3}
版本,因为它可以工作,但在 javascript 实现中不起作用。两个很好的答案和很多帮助。
I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
- Must have minimum 3 characters.
- If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
- AB - fail
- ABC - pass
- ABCDEFG - pass
- % - fail
- %AB - fail
- %ABC - pass
- %ABCDEFG - pass
- %%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^
- Start of string%?
- Greedy check for 0 or 1 % character\S{3}
- 3 other characters that are not white space
The problem is, the %?
for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %?
not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3}
because it worked but not in the javascript implementation. Both great answers and a lot of help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您所描述的行为不是“贪婪”,而是“占有欲”。正常的、贪婪的量词最初会尽可能多地匹配,但如果有必要的话会后退以允许整个正则表达式匹配(我喜欢将它们视为贪婪但包容)。这就是您遇到的情况:
%?
最初匹配前导百分号,但如果没有足够的字符来进行整体匹配,它会放弃百分号并让\S {3}
与其匹配。某些正则表达式风格(包括 Java 和 PHP)支持 所有格量词,即使如果这导致整个匹配失败。 .NET 没有这些,但它有其次的东西:原子组。无论您在原子组中放入什么,都像一个单独的正则表达式一样 - 它要么在应用它的位置匹配,要么不匹配,但它永远不会返回并尝试比原来更多或更少匹配,只是因为其余的正则表达式失败(也就是说,正则表达式引擎永远不会回溯到原子组)。以下是您如何使用它来解决您的问题:
如果字符串以百分号开头,则
(?>%?)
与其匹配,并且如果没有足够的字符留给\S{3}
匹配,正则表达式失败。请注意,正如@Dav 所证明的那样,原子组(或所有格量词)并不是解决此问题所必需的。但它们是非常强大的工具,可以轻松区分不可能和可能,或者太慢和尽可能流畅是。
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the
%?
originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets\S{3}
match it instead.Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
If the string starts with a percent sign, the
(?>%?)
matches it, and if there aren't enough characters left for\S{3}
to match, the regex fails.Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as @Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
如果可以的话,正则表达式将始终尝试匹配整个模式 - “贪婪”并不意味着“将始终抓住该字符(如果存在)”,而是意味着“将始终抓住该字符(如果存在)并且匹配可以抓住它来制作”。
相反,您可能想要的是这样的:
它将匹配 % 后跟 3 个字符,或非 %、非空白后跟 2 个字符。
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
我总是喜欢看 RE 问题,看看人们在这些问题上花费了多少时间来“节省时间”
虽然在现实生活中我会更明确,但我只是这样写,因为出于某种原因,有些人认为代码简洁是一个优势(我称之为反优势,但这不是现在的流行观点)
I always love to look at RE questions to see how much time people spend on them to "Save time"
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
尝试在 Dav 的原始正则表达式的基础上稍微修改一下正则表达式:
使用正则表达式选项“^ 和 $ 在换行符处匹配”。
Try the regex modified a little based on Dav's original one:
with the regex option "^ and $ match at line breaks" on.