为什么最小(非贪婪)匹配会受到字符串结尾字符“$”的影响?

发布于 2024-11-05 02:13:43 字数 739 浏览 0 评论 0 原文

编辑:删除原始示例,因为它引发了辅助答案。还固定了标题。

问题是为什么正则表达式中“$”的存在会影响表达式的贪婪性:

这是一个更简单的示例:

>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'

“?”似乎什么也没做。但请注意,当“$”被删除时,“?”就会被删除。受到尊重:

>>> m = re.search(r"a+?", str)
>>> m.group()
'a'

编辑: 换句话说,“a+?$”匹配所有的a,而不仅仅是最后一个,这不是我所期望的。这是正则表达式“+?”的描述来自 python 文档: “添加‘?’在限定符使其以非贪婪或最小方式执行匹配之后;将匹配尽可能少的字符。”

在这个例子中,情况似乎并非如此:字符串“a”与正则表达式“a+?$”匹配,那么为什么字符串“baaaaaaa”上的相同正则表达式不匹配只是一个a(最右边的一)?

EDIT: remove original example because it provoked ancillary answers. also fixed the title.

The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:

Here is a simpler example:

>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'

The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:

>>> m = re.search(r"a+?", str)
>>> m.group()
'a'

EDIT:
In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs:
"Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."

This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

想你只要分分秒秒 2024-11-12 02:13:43

Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.

(Answer changed after OP clarification in comments. See history for previous text.)

小嗲 2024-11-12 02:13:43

非贪婪修饰符仅影响比赛停止的位置,而不会影响比赛开始的位置。如果您想尽可能晚地开始匹配,则必须将 .+? 添加到模式的开头。

如果没有 $,您的模式将允许变得不那么贪婪并更快停止,因为它不必匹配字符串的末尾。

编辑:

更多详细信息...在这种情况下:

re.search(r"a+?$", "baaaaaaaa")

正则表达式引擎将忽略第一个“a”之前的所有内容,因为这就是 re.search 的工作原理。它将匹配第一个 a,并且“想要”返回一个匹配项,但它与模式尚不匹配,因为它必须达到 $ 的匹配项。所以它只是一次继续吃 a 的一个并检查 $。如果它是贪婪的,它不会在每个 a 之后检查 $,而只会在无法匹配更多 a 之后才检查。 s。

但在这种情况下:

re.search(r"a+?", "baaaaaaaa")

正则表达式引擎将在吃完第一个匹配项后检查它是否有完整的匹配项(因为它是非贪婪的)并且成功,因为其中没有$这个案例。

The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.

Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.

EDIT:

More details... In this case:

re.search(r"a+?$", "baaaaaaaa")

the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.

But in this case:

re.search(r"a+?", "baaaaaaaa")

the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.

愛上了 2024-11-12 02:13:43

正则表达式中 $ 的存在不会影响表达式的贪婪性。它只是增加了整个比赛成功必须满足的另一个条件。

a+a+? 都需要消耗它们找到的第一个 a。如果该 a 后面跟着更多的 a,则 a+ 会继续并消耗它们,而 a+?只满足于一个。如果正则表达式还有更多内容,a+ 会愿意接受更少的 a,而 a+? 会消耗更多,如果这就是取得一场比赛所需要的。

使用 a+$a+?$,您添加了另一个条件:匹配至少一个 a 后跟字符串的末尾。 a+ 最初仍会消耗所有 a,然后将其交给锚点 ($)。第一次尝试就会成功,因此 a+ 不需要返回任何 a

另一方面,a+? 最初只消耗一个 a,然后传递给 $。失败了,因此控制权返回到 a+?,它消耗另一个 a 并再次移交。如此往复,直到 a+? 消耗掉最后一个 a 并且 $ 最终成功。所以,是的,a+?$ 确实与 a+$ 匹配相同数量的 a,但它这样做是不情愿的,而不是贪婪的。

至于其他地方提到的最左边最长的规则,它从来没有适用于 Perl 派生的正则表达式风格,比如 Python 的正则表达式。即使没有不情愿的量词,由于有序交替。我认为 Jan 的想法是正确的:Perl 派生(或正则表达式导向)风格应该被称为 eager< /a>,不贪心。

我相信最左边最长的规则仅适用于 POSIX NFA 正则表达式,它在底层使用 NFA 引擎,但需要返回与 DFA(文本导向)正则表达式相同的结果。

The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.

Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.

With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.

On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.

As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.

I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.

墟烟 2024-11-12 02:13:43

原始问题的答案:

为什么第一个 search() 跨越
多个“/”而不是取
最短匹配?

非贪婪子模式将采用与整个模式一致的最短匹配。在您的示例中,最后一个子模式是 $,因此前面的子模式需要延伸到字符串的末尾。

回答修改后的问题:

非贪婪子模式将采用与整个后续模式一致的最短匹配。

另一种看待它的方式:非贪婪子模式最初将匹配最短的可能匹配。但是,如果这导致整个模式失败,则会使用额外的字符重试。此过程将持续进行,直到子模式失败(导致整个模式失败)或整个模式匹配。

Answer to original question:

Why does the first search() span
multiple "/"s rather than taking the
shortest match?

A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.

Answer to revised question:

A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.

Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.

落在眉间の轻吻 2024-11-12 02:13:43

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

>>> # cancels out any meaning that the `?` might have. >>> pattern = re.search(r"a+?$", string) >>> pattern.group() 'aaa' >>> >>> # Here you remove the `

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

, so it matches the least amount of `a` it can. >>> pattern = re.search(r"a+?", string) >>> pattern.group() 'a'

底线是字符串 a+? 匹配一个 a(句点)。但是,a+?$ 匹配 a直到行尾。请注意,如果没有明确的分组,您将很难让 ? 具有任何意义。一般来说,无论如何,最好用括号明确说明您要分组的内容。让我给您举一个带有显式组的示例。

>>> # This is close to the example pattern with `a+?

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

>>> # cancels out any meaning that the `?` might have. >>> pattern = re.search(r"a+?$", string) >>> pattern.group() 'aaa' >>> >>> # Here you remove the `

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

, so it matches the least amount of `a` it can. >>> pattern = re.search(r"a+?", string) >>> pattern.group() 'a'

底线是字符串 a+? 匹配一个 a(句点)。但是,a+?$ 匹配 a直到行尾。请注意,如果没有明确的分组,您将很难让 ? 具有任何意义。一般来说,无论如何,最好用括号明确说明您要分组的内容。让我给您举一个带有显式组的示例。

and therefore `a+

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

>>> # cancels out any meaning that the `?` might have. >>> pattern = re.search(r"a+?$", string) >>> pattern.group() 'aaa' >>> >>> # Here you remove the `

这里有两个问题。您使用 group() 而不指定组,我可以告诉您对带显式括号组的正则表达式和不带括号组的正则表达式的行为感到困惑。您观察到的这种不带括号的行为只是 Python 提供的快捷方式,您需要阅读 group() 来完全理解它。

, so it matches the least amount of `a` it can. >>> pattern = re.search(r"a+?", string) >>> pattern.group() 'a'

底线是字符串 a+? 匹配一个 a(句点)。但是,a+?$ 匹配 a直到行尾。请注意,如果没有明确的分组,您将很难让 ? 具有任何意义。一般来说,无论如何,最好用括号明确说明您要分组的内容。让我给您举一个带有显式组的示例。

. >>> # It matches `a`s until the end of the line. Again the `?` can't do anything. >>> pattern = re.search(r"(a+?)$", string) >>> pattern.group(1) 'aaa' >>> >>> # In order to get the `?` to work, you need something else in your pattern >>> # and outside your group that can be matched that will allow the selection >>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up >>> # everything that the lazy `a+?` doesn't want to. >>> pattern = re.search(r"(a+?).*$", string) >>> pattern.group(1) 'a'

编辑:删除了与旧版本问题相关的文本。

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> # cancels out any meaning that the `?` might have. >>> pattern = re.search(r"a+?$", string) >>> pattern.group() 'aaa' >>> >>> # Here you remove the `

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

, so it matches the least amount of `a` it can. >>> pattern = re.search(r"a+?", string) >>> pattern.group() 'a'

Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.

>>> # This is close to the example pattern with `a+?

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> # cancels out any meaning that the `?` might have. >>> pattern = re.search(r"a+?$", string) >>> pattern.group() 'aaa' >>> >>> # Here you remove the `

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

, so it matches the least amount of `a` it can. >>> pattern = re.search(r"a+?", string) >>> pattern.group() 'a'

Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.

and therefore `a+

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> # cancels out any meaning that the `?` might have. >>> pattern = re.search(r"a+?$", string) >>> pattern.group() 'aaa' >>> >>> # Here you remove the `

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

, so it matches the least amount of `a` it can. >>> pattern = re.search(r"a+?", string) >>> pattern.group() 'a'

Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.

. >>> # It matches `a`s until the end of the line. Again the `?` can't do anything. >>> pattern = re.search(r"(a+?)$", string) >>> pattern.group(1) 'aaa' >>> >>> # In order to get the `?` to work, you need something else in your pattern >>> # and outside your group that can be matched that will allow the selection >>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up >>> # everything that the lazy `a+?` doesn't want to. >>> pattern = re.search(r"(a+?).*$", string) >>> pattern.group(1) 'a'

Edit: Removed text related to old versions of the question.

往事随风而去 2024-11-12 02:13:43

除非您的问题不包含一些重要信息,否则您不需要也不应该使用正则表达式来完成此任务。

>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis

Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.

>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文