编辑:删除原始示例,因为它引发了辅助答案。还固定了标题。
问题是为什么正则表达式中“$”的存在会影响表达式的贪婪性:
这是一个更简单的示例:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
“?”似乎什么也没做。但请注意,当“$”被删除时,“?”就会被删除。受到尊重:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
编辑:
换句话说,“a+?$”匹配所有的a,而不仅仅是最后一个,这不是我所期望的。这是正则表达式“+?”的描述来自 python 文档:
“添加‘?’在限定符使其以非贪婪或最小方式执行匹配之后;将匹配尽可能少的字符。”
在这个例子中,情况似乎并非如此:字符串“a”与正则表达式“a+?$”匹配,那么为什么字符串“baaaaaaa”上的相同正则表达式不匹配只是一个a(最右边的一)?
EDIT: remove original example because it provoked ancillary answers. also fixed the title.
The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:
Here is a simpler example:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
EDIT:
In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs:
"Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."
This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?
发布评论
评论(6)
匹配项按“最左边,然后是最长”;然而,“最长”是在允许非贪婪之前使用的术语,而是意味着“每个原子的首选重复次数”。最左边比重复次数更重要。因此,“a+?$”不会匹配“baaaaa”中的最后一个 A,因为第一个 A 的匹配在字符串中较早开始。
(在评论中对OP进行澄清后,答案发生了变化。请参阅历史记录以了解以前的文本。)
Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.
(Answer changed after OP clarification in comments. See history for previous text.)
非贪婪修饰符仅影响比赛停止的位置,而不会影响比赛开始的位置。如果您想尽可能晚地开始匹配,则必须将
.+?
添加到模式的开头。如果没有
$
,您的模式将允许变得不那么贪婪并更快停止,因为它不必匹配字符串的末尾。编辑:
更多详细信息...在这种情况下:
正则表达式引擎将忽略第一个“a”之前的所有内容,因为这就是
re.search
的工作原理。它将匹配第一个a
,并且“想要”返回一个匹配项,但它与模式尚不匹配,因为它必须达到$
的匹配项。所以它只是一次继续吃a
的一个并检查$
。如果它是贪婪的,它不会在每个a
之后检查$
,而只会在无法匹配更多a
之后才检查。 s。但在这种情况下:
正则表达式引擎将在吃完第一个匹配项后检查它是否有完整的匹配项(因为它是非贪婪的)并且成功,因为其中没有
$
这个案例。The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add
.+?
to the beginning of your pattern.Without the
$
, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.EDIT:
More details... In this case:
the regex engine will ignore everything up until the first 'a', because that's how
re.search
works. It will match the firsta
, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the$
. So it just keeps eating thea
's one at a time and checking for$
. If it were greedy, it wouldn't check for the$
after eacha
, but only after it couldn't match any morea
's.But in this case:
the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no
$
in this case.正则表达式中
$
的存在不会影响表达式的贪婪性。它只是增加了整个比赛成功必须满足的另一个条件。a+
和a+?
都需要消耗它们找到的第一个a
。如果该a
后面跟着更多的a
,则a+
会继续并消耗它们,而a+?
只满足于一个。如果正则表达式还有更多内容,a+
会愿意接受更少的a
,而a+?
会消耗更多,如果这就是取得一场比赛所需要的。使用
a+$
和a+?$
,您添加了另一个条件:匹配至少一个a
后跟字符串的末尾。a+
最初仍会消耗所有a
,然后将其交给锚点 ($
)。第一次尝试就会成功,因此a+
不需要返回任何a
。另一方面,
a+?
最初只消耗一个a
,然后传递给$
。失败了,因此控制权返回到a+?
,它消耗另一个a
并再次移交。如此往复,直到a+?
消耗掉最后一个a
并且$
最终成功。所以,是的,a+?$
确实与a+$
匹配相同数量的a
,但它这样做是不情愿的,而不是贪婪的。至于其他地方提到的最左边最长的规则,它从来没有适用于 Perl 派生的正则表达式风格,比如 Python 的正则表达式。即使没有不情愿的量词,由于有序交替。我认为 Jan 的想法是正确的:Perl 派生(或正则表达式导向)风格应该被称为 eager< /a>,不贪心。
我相信最左边最长的规则仅适用于 POSIX NFA 正则表达式,它在底层使用 NFA 引擎,但需要返回与 DFA(文本导向)正则表达式相同的结果。
The presence of the
$
in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.Both
a+
anda+?
are required to consume the firsta
they find. If thata
is followed by morea
's,a+
goes ahead and consumes them too, whilea+?
is content with just the one. If there were anything more to the regex,a+
would be willing to settle for fewera
's, anda+?
would consume more, if that's what it took to achieve a match.With
a+$
anda+?$
, you've added another condition: match at least onea
followed by the end of the string.a+
still consumes all of thea
's initially, then it hands off to the anchor ($
). That succeeds on the first try, soa+
is not required to give back any of itsa
's.On the other hand,
a+?
initially consumes just the onea
before handing off to$
. That fails, so control is returned toa+?
, which consumes anothera
and hands off again. And so it goes, untila+?
consumes the lasta
and$
finally succeeds. So yes,a+?$
does match the same number ofa
's asa+$
, but it does so reluctantly, not greedily.As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.
I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.
原始问题的答案:
非贪婪子模式将采用与整个模式一致的最短匹配。在您的示例中,最后一个子模式是
$
,因此前面的子模式需要延伸到字符串的末尾。回答修改后的问题:
非贪婪子模式将采用与整个后续模式一致的最短匹配。
另一种看待它的方式:非贪婪子模式最初将匹配最短的可能匹配。但是,如果这导致整个模式失败,则会使用额外的字符重试。此过程将持续进行,直到子模式失败(导致整个模式失败)或整个模式匹配。
Answer to original question:
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is
$
, so the previous ones need to stretch out to the end of the string.Answer to revised question:
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.
Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.
这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。
底线是字符串
a+?
匹配一个a
(句点)。但是,a+?$
匹配a
的直到行尾。请注意,如果没有明确的分组,您将很难让?
具有任何意义。一般来说,无论如何,最好用括号明确说明您要分组的内容。让我给您举一个带有显式组的示例。编辑:删除了与旧版本问题相关的文本。
There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.
Bottom line is that the string
a+?
matches onea
, period. However,a+?$
matchesa
's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the?
to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.Edit: Removed text related to old versions of the question.
除非您的问题不包含一些重要信息,否则您不需要也不应该使用正则表达式来完成此任务。
Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.