正则表达式 ‘(?<=#)[^#]+(?=#)’工作?

发布于 2024-09-06 17:34:49 字数 498 浏览 13 评论 0 原文

我在 C# 程序中有以下正则表达式,并且很难理解它:

(?<=#)[^#]+(?=#)

我将其分解为我认为我理解的内容:

(?<=#)    a group, matching a hash. what's `?<=`?
[^#]+     one or more non-hashes (used to achieve non-greediness)
(?=#)     another group, matching a hash. what's the `?=`?

所以我遇到的问题是 ?<=?< 部分。从 MSDN 来看,? 用于命名组,但在这种情况下,尖括号永远不会关闭。

我在文档中找不到 ?= ,搜索它确实很困难,因为搜索引擎大多会忽略这些特殊字符。

I have the following regex in a C# program, and have difficulties understanding it:

(?<=#)[^#]+(?=#)

I'll break it down to what I think I understood:

(?<=#)    a group, matching a hash. what's `?<=`?
[^#]+     one or more non-hashes (used to achieve non-greediness)
(?=#)     another group, matching a hash. what's the `?=`?

So the problem I have is the ?<= and ?< part. From reading MSDN, ?<name> is used for naming groups, but in this case the angle bracket is never closed.

I couldn't find ?= in the docs, and searching for it is really difficult, because search engines will mostly ignore those special chars.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

违心° 2024-09-13 17:34:49

它们被称为环视;它们允许您断言模式是否匹配,而无需实际进行匹配。有 4 种基本的环视:

  • 积极的环视:看看我们是否可以匹配模式...
    • (?=pattern) - ...当前位置的右侧(向前看)
    • (?<=pattern) - ...当前位置的左侧(向后看后面
  • 负环视 - 看看我们是否无法匹配模式
    • (?!pattern) - ...右侧
    • (? - ...左边

作为一个简单的提醒,环顾一下:

  • =! em>
  • < 是向后看,否则是向前看

参考文献


但是为什么要使用lookarounds呢?

有人可能会认为上面的模式中的环视是不必要的,并且 #([^#]+)# 可以很好地完成这项工作(提取 \1 来获取非 #)。

不完全是。不同之处在于,由于环视与 # 不匹配,因此下次尝试查找匹配时可以再次“使用”它。简单地说,环视允许“匹配”重叠。

考虑以下输入字符串:

and #one# and #two# and #three#four#

现在,#([az]+)# 将给出以下匹配项 (如 rubular.com 上所示):

and #one# and #two# and #three#four#
    \___/     \___/     \_____/

将其与 (?<=#)[az]+(?=#) 进行比较,匹配:

and #one# and #two# and #three#four#
     \_/       \_/       \___/ \__/

不幸的是,这可以'无法在 rubular.com 上进行演示,因为它不支持lookbehind。但是,它确实支持前瞻,因此我们可以使用 #([az]+)(?=#) 执行类似的操作,它匹配 (如 rubular.com 上所示):

and #one# and #two# and #three#four#
    \__/      \__/      \____/\___/

参考文献

They are called lookarounds; they allow you to assert if a pattern matches or not, without actually making the match. There are 4 basic lookarounds:

  • Positive lookarounds: see if we CAN match the pattern...
    • (?=pattern) - ... to the right of current position (look ahead)
    • (?<=pattern) - ... to the left of current position (look behind)
  • Negative lookarounds - see if we can NOT match the pattern
    • (?!pattern) - ... to the right
    • (?<!pattern) - ... to the left

As an easy reminder, for a lookaround:

  • = is positive, ! is negative
  • < is look behind, otherwise it's look ahead

References


But why use lookarounds?

One might argue that lookarounds in the pattern above aren't necessary, and #([^#]+)# will do the job just fine (extracting the string captured by \1 to get the non-#).

Not quite. The difference is that since a lookaround doesn't match the #, it can be "used" again by the next attempt to find a match. Simplistically speaking, lookarounds allow "matches" to overlap.

Consider the following input string:

and #one# and #two# and #three#four#

Now, #([a-z]+)# will give the following matches (as seen on rubular.com):

and #one# and #two# and #three#four#
    \___/     \___/     \_____/

Compare this with (?<=#)[a-z]+(?=#), which matches:

and #one# and #two# and #three#four#
     \_/       \_/       \___/ \__/

Unfortunately this can't be demonstrated on rubular.com, since it doesn't support lookbehind. However, it does support lookahead, so we can do something similar with #([a-z]+)(?=#), which matches (as seen on rubular.com):

and #one# and #two# and #three#four#
    \__/      \__/      \____/\___/

References

暮色兮凉城 2024-09-13 17:34:49

正如另一张海报提到的,这些是lookarounds,特殊的构造用于更改匹配的内容和时间。这表示:

(?<=#)    match but don't capture, the string `#`
            when followed by the next expression

[^#]+     one or more characters that are not `#`, and

(?=#)     match but don't capture, the string `#`
            when preceded by the last expression

因此这将匹配两个 # 之间的所有字符。

前瞻和后视在许多情况下都非常有用。例如,考虑规则“匹配所有后面不跟有 ab”。您的第一次尝试可能类似于 b[^a],但这是不对的:这也会匹配 bus 中的 bu 或 < code>bo 位于 boy 中,但您只想要 b。即使后面没有 a,它也不会匹配 cab 中的 b,因为没有更多的字符可以匹配。

要正确执行此操作,您需要先行查看:b(?!a)。这表示“匹配 b 但之后不匹配 a,并且不将其作为匹配的一部分”。因此,它只会匹配 bolo 中的 b,这正是您想要的;同样,它会匹配 cab 中的 b

As another poster mentioned, these are lookarounds, special constructs for changing what gets matched and when. This says:

(?<=#)    match but don't capture, the string `#`
            when followed by the next expression

[^#]+     one or more characters that are not `#`, and

(?=#)     match but don't capture, the string `#`
            when preceded by the last expression

So this will match all the characters in between two #s.

Lookaheads and lookbehinds are very useful in many cases. Consider, for example, the rule "match all bs not followed by an a." Your first attempt might be something like b[^a], but that's not right: this will also match the bu in bus or the bo in boy, but you only wanted the b. And it won't match the b in cab, even though that's not followed by an a, because there are no more characters to match.

To do that correctly, you need a lookahead: b(?!a). This says "match a b but don't match an a afterwards, and don't make that part of the match". Thus it'll match just the b in bolo, which is what you want; likewise it'll match the b in cab.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文