>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'
为什么 group(0) 与 Superman 不匹配? 此环视教程说:
(?
>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'
Why isn't group(0) matching Superman? This lookaround tutorial says:
(?<!a)b matches a "b" that is not
preceded by an "a", using negative
lookbehind
发布评论
评论(5)
Batman
不直接位于Bat
之前,因此首先匹配。事实上,超人
都不是;字符串之间有一个逗号,它可以很好地允许 RE 匹配,但无论如何都不会匹配,因为它可以匹配字符串中的较早部分。也许这会更好地解释:如果字符串是
Batman
并且您开始尝试从m
匹配,那么RE将不会匹配,直到之后的字符(给出an
的匹配),因为这是字符串中唯一以Bat
开头的位置。Batman
isn't directly preceded byBat
, so that matches first. In fact, neither isSuperman
; there's a comma in-between in your string which will do just fine to allow that RE to match, but that's not matched anyway because it's possible to match earlier in the string.Maybe this will explain better: if the string was
Batman
and you were starting to try to match from them
, the RE would not match until the character after (giving a match ofan
) because that's the only place in the string which is preceded byBat
.在简单的层面上,正则表达式引擎从字符串的左侧开始,逐渐向右移动,尝试匹配您的模式(将其想象为在字符串中移动的光标)。在环视的情况下,在光标的每个停止处,都会断言环视,如果为真,则引擎继续尝试进行匹配。一旦引擎可以匹配您的模式,它就会返回匹配项。
在字符串的位置 0(即
Batman
中的B
之前),断言成功,因为Bat
不存在于当前字符串之前。位置 - 因此,\w+
可以匹配整个单词Batman
(请记住,正则表达式本质上是贪婪 - 即,将尽可能匹配) 。有关引擎内部结构的更多信息,请参阅此页面。
为了实现你想要的,你可以使用类似的东西:
在这种模式中,引擎将匹配 单词border (
\b
)1,后跟一个或多个单词字符,并断言单词字符不以Bat.使用lookahead而不是lookbehind,因为在这里使用lookbehind会产生与原始模式相同的问题;它会在紧跟在单词边界之后的位置之前查找,并且由于已经确定光标之前的位置是单词边界,因此否定后向查找总是成功。
1 请注意,字边界与
\w
和\W
之间的边界匹配(即[A-Za-z0-9_ ]
和任何其他字符;它还匹配^
和$
锚点)。如果您的边界需要更复杂,您将需要一种不同的方式来锚定您的模式。At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.
At position 0 of your string (ie. prior to the
B
inBatman
), the assertion succeeded, asBat
is not present before the current position - thus,\w+
can match the entire wordBatman
(remember, regexes are inherently greedy - ie. will match as much as possible).See this page for more information on engine internals.
To achieve what you wanted, you could instead use something like:
In this pattern, the engine will match a word boundary (
\b
)1, followed by one or more word characters, with the assertion that the word characters do not start withBat
. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.1 Note that word boundaries match a boundary between
\w
and\W
(ie. between[A-Za-z0-9_]
and any other character; it also matches the^
and$
anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.从手册:
http://docs.python.org/library/re.html#regular -表达式语法
From the manual:
http://docs.python.org/library/re.html#regular-expression-syntax
您正在查找前面没有“Bat”的第一组一个或多个字母数字字符 (
\w+
)。蝙蝠侠是第一场这样的比赛。 (请注意,负后向断言可以匹配字符串的开头。)You're looking for the first set of one or more alphanumeric characters (
\w+
) that is not preceded by 'Bat'. Batman is the first such match. (Note that negative lookbehind assertions can match the start of a string.)要执行您想要的操作,您必须限制正则表达式以专门匹配
'man'
;否则,正如其他人指出的那样,\w
贪婪地匹配包括'Batman'
在内的任何内容。如:To do what you want, you have to constrain the regex to match
'man'
specifically; otherwise, as others have pointed out,\w
greedily matches anything including'Batman'
. As in: