当前位置：文江博客话题详情

Python regex negative-lookbehind

您能解释一下为什么这个正则表达式不起作用吗？

发布于 2024-11-15 23:22:08 字数 346 浏览 6 评论 0 原文

>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'

为什么 group(0) 与 Superman 不匹配？此环视教程说：

(?

原文

>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'

Why isn't group(0) matching Superman? This lookaround tutorial says:

(?<!a)b matches a "b" that is not
preceded by an "a", using negative
lookbehind

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九八野马 2024-11-22 23:22:08

Batman 不直接位于 Bat 之前，因此首先匹配。事实上，超人都不是；字符串之间有一个逗号，它可以很好地允许 RE 匹配，但无论如何都不会匹配，因为它可以匹配字符串中的较早部分。

也许这会更好地解释：如果字符串是Batman并且您开始尝试从m匹配，那么RE将不会匹配，直到之后的字符（给出 an 的匹配），因为这是字符串中唯一以 Bat 开头的位置。

回复收藏 0 原文

夜深人未静 2024-11-22 23:22:08

在简单的层面上，正则表达式引擎从字符串的左侧开始，逐渐向右移动，尝试匹配您的模式（将其想象为在字符串中移动的光标）。在环视的情况下，在光标的每个停止处，都会断言环视，如果为真，则引擎继续尝试进行匹配。一旦引擎可以匹配您的模式，它就会返回匹配项。

在字符串的位置 0（即 Batman 中的 B 之前），断言成功，因为 Bat 不存在于当前字符串之前。位置 - 因此，\w+ 可以匹配整个单词 Batman（请记住，正则表达式本质上是贪婪 - 即，将尽可能匹配）。

有关引擎内部结构的更多信息，请参阅此页面。

为了实现你想要的，你可以使用类似的东西：

\b(?!Bat)\w+

在这种模式中，引擎将匹配单词border (\b)¹，后跟一个或多个单词字符，并断言单词字符不以 Bat.使用lookahead而不是lookbehind，因为在这里使用lookbehind会产生与原始模式相同的问题；它会在紧跟在单词边界之后的位置之前查找，并且由于已经确定光标之前的位置是单词边界，因此否定后向查找总是成功。

¹ 请注意，字边界与 \w 和 \W 之间的边界匹配（即 [A-Za-z0-9_ ] 和任何其他字符；它还匹配 ^ 和 $ 锚点）。如果您的边界需要更复杂，您将需要一种不同的方式来锚定您的模式。

At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.

At position 0 of your string (ie. prior to the B in Batman), the assertion succeeded, as Bat is not present before the current position - thus, \w+ can match the entire word Batman (remember, regexes are inherently greedy - ie. will match as much as possible).

See this page for more information on engine internals.

To achieve what you wanted, you could instead use something like:

\b(?!Bat)\w+

In this pattern, the engine will match a word boundary (\b)¹, followed by one or more word characters, with the assertion that the word characters do not start with Bat. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.

¹ Note that word boundaries match a boundary between \w and \W (ie. between [A-Za-z0-9_] and any other character; it also matches the ^ and $ anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.

回复收藏 0 原文

软糖 2024-11-22 23:22:08

从手册：

以负数开头的模式
向后断言可能匹配
字符串的开头是
已搜索。

http://docs.python.org/library/re.html#regular -表达式语法

回复收藏 0 原文

夏见 2024-11-22 23:22:08

您正在查找前面没有“Bat”的第一组一个或多个字母数字字符 (\w+)。蝙蝠侠是第一场这样的比赛。（请注意，负后向断言可以匹配字符串的开头。）

回复收藏 0 原文

呆萌少年 2024-11-22 23:22:08

要执行您想要的操作，您必须限制正则表达式以专门匹配 'man' ；否则，正如其他人指出的那样， \w 贪婪地匹配包括 'Batman' 在内的任何内容。如：

>>> re.search("\w+(?<!Bat)man","Batman,Superman").group(0)
'Superman'

To do what you want, you have to constrain the regex to match 'man' specifically; otherwise, as others have pointed out, \w greedily matches anything including 'Batman'. As in:

>>> re.search("\w+(?<!Bat)man","Batman,Superman").group(0)
'Superman'

回复收藏 0 原文

~没有更多了~

关于作者

转瞬即逝

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

您能解释一下为什么这个正则表达式不起作用吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚守退让之实

小兔几

mb_3y7WUgWY

友情链接

您能解释一下为什么这个正则表达式不起作用吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实