嵌套正则表达式向前查找和向后查找

发布于 2024-12-11 05:37:13 字数 755 浏览 2 评论 0原文

我在正则表达式中遇到嵌套“+”/“-”lookahead/lookbehind 的问题。

假设我想用 '%' 更改字符串中的 '*' ,并且 '\' 转义下一个字符。 (将正则表达式转换为类似命令的 sql ^^)。

所以字符串

  • '*test*' 应该改为 '%test%',
  • '\\*test\\*' -> ; '\\%test\\%',但
  • '\*test\*''\\\*test\\\*' > 应该保持不变。

我尝试过:

(?<!\\)(?=\\\\)*\*      but this doesn't work
(?<!\\)((?=\\\\)*\*)    ...
(?<!\\(?=\\\\)*)\*      ...
(?=(?<!\\)(?=\\\\)*)\*  ...

与上面给出的示例中的“*”匹配的正确正则表达式是什么?

(? 和 (?=(? 或者如果这些本质上是错误的,那么具有这种视觉结构的正则表达式之间的区别是什么?

I am having problems with the nested '+'/'-' lookahead/lookbehind in regex.

Let's say that I want to change the '*' in a string with '%' and let's say that '\' escapes the next character. (Turning a regex to sql like command ^^).

So the string

  • '*test*' should be changed to '%test%',
  • '\\*test\\*' -> '\\%test\\%', but
  • '\*test\*' and '\\\*test\\\*' should stay the same.

I tried:

(?<!\\)(?=\\\\)*\*      but this doesn't work
(?<!\\)((?=\\\\)*\*)    ...
(?<!\\(?=\\\\)*)\*      ...
(?=(?<!\\)(?=\\\\)*)\*  ...

What is the correct regex that will match the '*'s in examples given above?

What is the difference between (?<!\\(?=\\\\)*)\* and (?=(?<!\\)(?=\\\\)*)\* or if these are essentially wrong the difference between regex that have such a visual construction?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

我的奇迹 2024-12-18 05:37:13

要查找未转义字符,您需要查找前面有偶数个(或零个)转义字符的字符。这是相对简单的。

(?<=(?<!\\)(?:\\\\)*)\*        # this is explained in Tim Pietzcker' answer

不幸的是,许多正则表达式引擎不支持可变长度后向查找,因此我们必须用前向替换:

(?=(?<!\\)(?:\\\\)*\*)(\\*)\*  # also look at ridgerunner's improved version

将其替换为组 1 的内容和 % 符号。

解释

(?=           # start look-ahead
  (?<!\\)     #   a position not preceded by a backslash (via look-behind)
  (?:\\\\)*   #   an even number of backslashes (don't capture them)
  \*          #   a star
)             # end look-ahead. If found,
(             # start group 1
  \\*         #   match any number of backslashes in front of the star
)             # end group 1
\*            # match the star itself

前瞻确保只考虑偶数个反斜杠。无论如何,没有办法将它们匹配到一个组中,因为前瞻不会提前字符串中的位置。

To find an unescaped character, you would look for a character that is preceded by an even number of (or zero) escape characters. This is relatively straight-forward.

(?<=(?<!\\)(?:\\\\)*)\*        # this is explained in Tim Pietzcker' answer

Unfortunately, many regex engines do not support variable-length look-behind, so we have to substitute with look-ahead:

(?=(?<!\\)(?:\\\\)*\*)(\\*)\*  # also look at ridgerunner's improved version

Replace this with the contents of group 1 and a % sign.

Explanation

(?=           # start look-ahead
  (?<!\\)     #   a position not preceded by a backslash (via look-behind)
  (?:\\\\)*   #   an even number of backslashes (don't capture them)
  \*          #   a star
)             # end look-ahead. If found,
(             # start group 1
  \\*         #   match any number of backslashes in front of the star
)             # end group 1
\*            # match the star itself

The look-ahead makes sure only even numbers of backslashes are taken into account. Anyway, there is no way around matching them into a group, since the look-ahead does not advance the position in the string.

要走干脆点 2024-12-18 05:37:13

好吧,由于 Tim 决定不使用我建议的 mods 更新他的正则表达式(并且 Tomalak 的答案并不那么精简),这是我推荐的解决方案:

替换: ((? 和 $1%

这里是注释 PHP 片段的形式:

// Replace all non-escaped asterisks with "%".
$re = '%             # Match non-escaped asterisks.
    (                # $1: Any/all preceding escaped backslashes.
      (?<!\\\\)      # At a position not preceded by a backslash,
      (?:\\\\\\\\)*  # Match zero or more escaped backslashes.
    )                # End $1: Any preceding escaped backslashes.
    \*               # Unescaped literal asterisk.
    %x';
$text = preg_replace($re, '$1%', $text);

附录:非环视 JavaScript 解决方案

上面的解决方案确实需要lookbehind,所以它不会使用 JavaScript 工作。以下 JavaScript 解决方案使用lookbehind:

text = text.replace(/(\\[\S\s])|\*/g,
    function(m0, m1) {
        return m1 ? m1 : '%';
    });

此解决方案将 反斜杠-anything 的每个实例替换为其自身,并将 * 星号的每个实例替换为% 百分号。

编辑 2011 年 10 月 24 日:修复了 Javascript 版本以正确处理诸如:**text** 等情况。 (感谢 Alan Moore 指出之前版本中的错误。)

Ok, since Tim decided to not update his regex with my suggested mods (and Tomalak's answer is not as streamlined), here is my recommended solution:

Replace: ((?<!\\)(?:\\\\)*)\* with $1%

Here it is in the form of a commented PHP snippett:

// Replace all non-escaped asterisks with "%".
$re = '%             # Match non-escaped asterisks.
    (                # $1: Any/all preceding escaped backslashes.
      (?<!\\\\)      # At a position not preceded by a backslash,
      (?:\\\\\\\\)*  # Match zero or more escaped backslashes.
    )                # End $1: Any preceding escaped backslashes.
    \*               # Unescaped literal asterisk.
    %x';
$text = preg_replace($re, '$1%', $text);

Addendum: Non-lookaround JavaScript Solution

The above solution does require lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:

text = text.replace(/(\\[\S\s])|\*/g,
    function(m0, m1) {
        return m1 ? m1 : '%';
    });

This solution replaces each instance of backslash-anything with itself, and each instance of * asterisk with a % percent sign.

Edit 2011-10-24: Fixed Javascript version to correctly handle cases such as: **text**. (Thanks to Alan Moore for pointing out the error in previous version.)

等数载,海棠开 2024-12-18 05:37:13

其他人已经展示了如何通过lookbehind 来完成此操作,但我想说明根本不使用lookarounds 的情况。考虑这个解决方案(此处演示):

s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;

大部分正则表达式,[ ^*\\]*(?:\\.[^*\\]*)* 是 Friedl 的“展开循环”习语的一个示例。它消耗尽可能多的除星号或反斜杠之外的单个字符,或由反斜杠后跟任何内容组成的字符对。这使得它可以避免消耗未转义的星号,无论它们前面有多少转义的反斜杠(或其他字符)。

\G 将每个匹配锚定到上一个匹配结束的位置,或者如果这是第一次匹配尝试,则锚定到输入的开头。这可以防止正则表达式引擎简单地跳过转义的反斜杠并匹配未转义的星号。因此,/g 控制的匹配的每次迭代都会消耗直到下一个未转义星号的所有内容,捕获组 #1 中除星号之外的所有内容。然后将其重新插入,并将 * 替换为 %

我认为这至少与环视方法一样可读,并且更容易理解。它确实需要支持 \G,因此它不能在 JavaScript 或 Python 中工作,但在 Perl 中工作得很好。

Others have shown how this can be done with a lookbehind, but I'd like to make a case for not using lookarounds at all. Consider this solution (demo here):

s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;

The bulk of the regex, [^*\\]*(?:\\.[^*\\]*)*, is an example of Friedl's "unrolled loop" idiom. It consumes as many as it can of individual characters other than asterisk or backslash, or pairs of characters consisting of a backslash followed by anything. That allows it to avoid consuming unescaped asterisks, no matter how many escaped backslashes (or other characters) precede them.

The \G anchors each match to the position where the previous match ended, or to the beginning of the input if this is the first match attempt. This prevents the regex engine from simply skipping over escaped backslashes and matching the unescaped asterisks anyway. So, each iteration of the /g controlled match consumes everything up to the next unescaped asterisk, capturing all but the asterisk in group #1. Then that's plugged back in and the * is replaced with %.

I think this is at least as readable as the lookaround approaches, and easier to understand. It does require support for \G, so it won't work in JavaScript or Python, but it works just fine in Perl.

随心而道 2024-12-18 05:37:13

因此,您本质上只想匹配 * 仅当它前面有偶数个反斜杠时(或者,换句话说,如果它没有转义)?那么你根本不需要向前看,因为你只是向后看,不是吗?

搜索

(?<=(?<!\\)(?:\\\\)*)\*

并替换为 %

说明:

(?<=       # Assert that it's possible to match before the current position...
 (?<!\\)   # (unless there are more backslashes before that)
 (?:\\\\)* # an even number of backslashes
)          # End of lookbehind
\*         # Then match an asterisk

So you essentially want to match * only if it's preceded by an even number of backslashes (or, in other words, if it isn't escaped)? Then you don't need lookahead at all since you're only looking back, aren't you?

Search for

(?<=(?<!\\)(?:\\\\)*)\*

and replace with %.

Explanation:

(?<=       # Assert that it's possible to match before the current position...
 (?<!\\)   # (unless there are more backslashes before that)
 (?:\\\\)* # an even number of backslashes
)          # End of lookbehind
\*         # Then match an asterisk
空宴 2024-12-18 05:37:13

在正则表达式中检测转义反斜杠的问题让我着迷了一段时间,直到最近我才意识到我完全把它复杂化了。有几件事使它变得更简单,据我所知,这里还没有人注意到它们:

  • 反斜杠转义其后的任何字符,而不仅仅是其他反斜杠。因此 (\\.)* 将吃掉整个转义字符链,无论它们是否是反斜杠。您不必担心偶数或奇数斜杠;只需检查链的开头或结尾是否有一个单独的 \ridgerunner 的 JavaScript 解决方案确实利用了这一点)。

  • 环视并不是确保从链中的第一个反斜杠开始的唯一方法。您可以只查找非反斜杠字符(或字符串的开头)。

结果是一个简短的模式,不需要环视或回调,并且它比我迄今为止看到的任何其他模式都短。

/(?!<\\)(\\.)*\*/g

替换字符串:

"$1%"

这适用于 .NET,允许lookbehinds,并且它应该在 Perl 中为你工作。可以在 JavaScript 中完成此操作,但如果没有后视或 \G 锚点,我看不到一种可以在一行中完成此操作的方法。 Ridgerunner 的回调应该可以工作,循环也可以:

var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
    input = input.replace(regx, '$1$2%');
}

这里有很多我从其他正则表达式问题中认识的名字,而且我知道你们中的一些人比我更聪明。如果我犯了错误,请说出来。

The problem of detecting escaped backslashes in regex has fascinated me for a while, and it wasn't until recently that I realized I was completely overcomplicating it. There are a couple of things that make it simpler, and as far as I can tell nobody here has noticed them yet:

  • Backslashes escape any character after them, not just other backslashes. So (\\.)* will eat an entire chain of escaped characters, whether they're backslashes or not. You don't have to worry about even- or odd-numbered slashes; just check for a solitary \ at the beginning or end of the chain (ridgerunner's JavaScript solution does take advantage of this).

  • Lookarounds aren't the only way to make sure you start with the first backslash in a chain. You can just look for a non-backslash character (or the start of the string).

The result is a short, simple pattern that doesn't need lookarounds or callbacks, and it's shorter than anything else I see so far.

/(?!<\\)(\\.)*\*/g

And the replacement string:

"$1%"

This works in .NET, which allows lookbehinds, and it should work for you in Perl. It's possible to do it in JavaScript, but without lookbehinds or the \G anchor, I can't see a way to do it in a one-liner. Ridgerunner's callback should work, as will a loop:

var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
    input = input.replace(regx, '$1$2%');
}

There are a lot of names here I recognize from other regex questions, and I know some of you are smarter than me. If I've made a mistake, please say so.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文