这个正则表达式替换如何反转字符串?

发布于 2024-09-18 23:07:04 字数 4758 浏览 2 评论 0 原文

这是一系列教育正则表达式文章的第四部分。它显示了如何组合嵌套引用(请参阅:此正则表达式如何查找三角形数字?)在断言中“计数”(参见:我们如何将 a^nb^n 与 Java 正则表达式匹配?)可用于反转字符串。以编程方式生成的模式使用元模式抽象(请参阅:此 Java 正则表达式如何检测回文?)。在本系列中,这些技术首次用于替换而不是整个字符串匹配。

提供了完整的工作 Java 和 C# 实现。包括鼓舞人心的名言。

使用正则表达式反转字符串似乎从来都不是一个好主意,甚至也不是立即显而易见的,如果它可能的话,如果可能的话,人们可能会如何尝试这样做。

虽然这仍然不是一个好主意,但至少现在我们知道这是可能的,因为这是一种实现方法:

C# (也在 ideone.com)

using System;
using System.Text.RegularExpressions;

public class TwoDollarReversal {    
public static void Main() {
   string REVERSE = 
      @"(?sx) . grab$2"
         .Replace("grab$2",
            ForEachDotBehind(
               AssertSuffix(@"((.) \1?)")
            )
         );
   Console.WriteLine(
      Regex.Replace(
         @"nietsniE treblA --
         hguone llew ti dnatsrednu t'nod uoy ,ylpmis ti nialpxe t'nac uoy fI",

         REVERSE, "$2"
      )
   );
   // If you can't explain it simply, you don't understand it well enough
   // -- Albert Einstein
}      
// performs an assertion for each dot behind current position
static string ForEachDotBehind(string assertion) {
   return "(?<=(?:.assertion)*)".Replace("assertion", assertion);
}
// asserts that the suffix of the string matches a given pattern
static string AssertSuffix(string pattern) {
   return "(?=.*$(?<=pattern))".Replace("pattern", pattern);
}

}

Java (也在 ideone.com)

class TwoDollarReversal {

public static void main(String[] args) {
   String REVERSE =
      "(?sx) . grab$2"
         .replace("grab$2",
            forEachDotBehind(
               assertSuffix("((.) \\1?)")
            )
         );

   System.out.println(
      "taerG eht rednaxelA --\nyrt lliw ohw mih ot elbissopmi gnihton si erehT"
         .replaceAll(REVERSE, "$2")
   );
   // There is nothing impossible to him who will try
   // -- Alexander the Great"
}

static String forEachDotBehind(String assertion) {
   return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
}
static String assertSuffix(String pattern) {
   return "(?<=(?=^.*?pattern$).*)".replace("pattern", pattern);
}

}

C# 和 Java 版本似乎都使用相同的整体算法,仅在抽象的实现细节上有微小的变化。

显然这不是反转字符串的最佳、最直接、最有效的方法。也就是说,为了学习正则表达式;如何概念化模式;引擎如何工作来匹配它们;如何将各个部分组合在一起构建我们想要的东西;如何以可读和可维护的方式做到这一点;只是为了纯粹的学习新东西的乐趣,我们能解释一下它是如何工作的吗?


附录:备忘单!

这是所使用的基本正则表达式构造的简要描述:

  • (?sx) 是嵌入标志修饰符s 启用“单行”模式,允许 匹配任何字符(包括换行符)。 x 启用 自由间距 模式,其中未转义的空格忽略(并且 # 可用于注释)。
  • ^$ 是行的开头和结尾锚点
  • ? 作为重复说明符表示 可选 (即零或-之一)。作为例如 .*? 中的重复量词,它表示 *(即零个或多个)重复是 不情愿/非贪婪。
  • (…) 用于分组(?:...) 是非捕获组。捕获组保存它匹配的字符串;它允许后退/前进/嵌套引用(例如 \1)、替换替换(例如 $2)等。
  • (?=...) 是积极的前瞻;它向右查看以断言给定模式存在匹配。 (?<=…) 是一个积极的lookbehind;它看起来向左。

语言参考/其他资源

This is the fourth part in a series of educational regex articles. It show how the combination of nested reference (see: How does this regex find triangular numbers?) to "count" within assertions (see: How can we match a^n b^n with Java regex?) can be used to reverse a string. The programmatically generated pattern uses meta-pattern abstractions (see: How does this Java regex detect palindromes?). For the first time in the series, these techniques are used for replacement instead of whole string matching.

Complete working Java and C# implementations are provided. Inspirational quotes included.

Reversing a string using regular expressions never seemed like a good idea, nor was it even immediately obvious if it was at all possible, and if so, how one might attempt to do so.

While it's still not a good idea, at least now we know that it's possible, because here's one way to do it:

C# (also on ideone.com)

using System;
using System.Text.RegularExpressions;

public class TwoDollarReversal {    
public static void Main() {
   string REVERSE = 
      @"(?sx) . grab$2"
         .Replace("grab$2",
            ForEachDotBehind(
               AssertSuffix(@"((.) \1?)")
            )
         );
   Console.WriteLine(
      Regex.Replace(
         @"nietsniE treblA --
         hguone llew ti dnatsrednu t'nod uoy ,ylpmis ti nialpxe t'nac uoy fI",

         REVERSE, "$2"
      )
   );
   // If you can't explain it simply, you don't understand it well enough
   // -- Albert Einstein
}      
// performs an assertion for each dot behind current position
static string ForEachDotBehind(string assertion) {
   return "(?<=(?:.assertion)*)".Replace("assertion", assertion);
}
// asserts that the suffix of the string matches a given pattern
static string AssertSuffix(string pattern) {
   return "(?=.*$(?<=pattern))".Replace("pattern", pattern);
}

}

Java (also on ideone.com)

class TwoDollarReversal {

public static void main(String[] args) {
   String REVERSE =
      "(?sx) . grab$2"
         .replace("grab$2",
            forEachDotBehind(
               assertSuffix("((.) \\1?)")
            )
         );

   System.out.println(
      "taerG eht rednaxelA --\nyrt lliw ohw mih ot elbissopmi gnihton si erehT"
         .replaceAll(REVERSE, "$2")
   );
   // There is nothing impossible to him who will try
   // -- Alexander the Great"
}

static String forEachDotBehind(String assertion) {
   return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
}
static String assertSuffix(String pattern) {
   return "(?<=(?=^.*?pattern$).*)".replace("pattern", pattern);
}

}

Both the C# and Java versions seem to use the same overall algorithm, with minor variations only in the abstracted implementation details.

Clearly this is not the best, most straightforward, most efficient way to reverse a string. That said, in the interest of learning about regex; how to conceptualize patterns; how the engine works to match them; how to put various parts together to build what we want; how to do so in a way that is readable and maintainable; and just for the sheer joy of learning something new, can we have an explanation of how this works?


Appendix: Cheat sheet!

This is a brief description of the basic regex constructs used:

  • (?sx) is the embedded flag modifiers. s enables the "single-line" mode, allowing the dot to match ANY character (including newlines). x enables the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
  • ^ and $ are the beginning and end-of-the-line anchors.
  • ? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
  • (…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows back/forward/nested references (e.g. \1), replacement substitution (e.g. $2), etc.
  • (?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern. (?<=…) is a positive lookbehind; it looks to the left.

Language references/additional resources

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

郁金香雨 2024-09-25 23:07:06

使用正则表达式进行反转并不像某些人想象的那么复杂 - 只需使用 2 的幂即可完成,并使用几个循环,这本质上是对应用于任意字符串的典型位反转技巧的重新哈希,因此所需的循环周期数只是字符串长度的整数 log-base-2,而不是一次一个字符:

 LANG="en_US.UTF-8" echo '1234567890123\nABCDefghi\n코요태이하픽에' | 
 LANG="en_US.UTF-8" gawk -e '

   # gawk profile, created Sun Dec  4 17:22:55 2022

   # Functions, listed alphabetically

 5  function _____(__,___,_,____,______)
    {
 5      if ((___=(_=_<_)<+___ ? +___ : length(__))<=(++_+_)) {

            return substr((__)__,___,___*=_<=___)
        }

16      do { ++____ } while ((_+=_)<___)

 5      if (___<_) { # 5

 5        ______ = (___-(_=(_^!_+_^!_)^—____))<=_^!_ \
                   ?       substr(__,___) \
                   : _____(substr(__,_+!!_),___-_)

 5            __ = substr(__,!!_,___=_)
         }
 5       ___ = "."
11       while (____--) {

11            sub("$", "\3", __)
             gsub("(^|" (___)  (")")___, "\3&\5&", __)
             gsub("\5|(^|" (___) ")\3("(___)"|$)", "",__)

11            ___=(___)___
         }
 5       return (______)__
      }

      BEGIN {
 1       OFS = "\f\r\t"
      }

 3    ($++NF = _____($!_))^_'
 1  1234567890123
    3210987654321

 2  ABCDefghi
    ihgfeDCBA

 3  코요태이하픽에
    에픽하이태요코

using regex to reverse is not as complex as some think - just do it by powers of 2, with a few loops that is essentially a rehash of the typical bit-reversal trick applied over arbitrary strings, so # of loop cycles needed is simply the integer log-base-2 of the string length, instead of one character at a time :

 LANG="en_US.UTF-8" echo '1234567890123\nABCDefghi\n코요태이하픽에' | 
 LANG="en_US.UTF-8" gawk -e '

   # gawk profile, created Sun Dec  4 17:22:55 2022

   # Functions, listed alphabetically

 5  function _____(__,___,_,____,______)
    {
 5      if ((___=(_=_<_)<+___ ? +___ : length(__))<=(++_+_)) {

            return substr((__)__,___,___*=_<=___)
        }

16      do { ++____ } while ((_+=_)<___)

 5      if (___<_) { # 5

 5        ______ = (___-(_=(_^!_+_^!_)^—____))<=_^!_ \
                   ?       substr(__,___) \
                   : _____(substr(__,_+!!_),___-_)

 5            __ = substr(__,!!_,___=_)
         }
 5       ___ = "."
11       while (____--) {

11            sub("
quot;, "\3", __)
             gsub("(^|" (___)  (")")___, "\3&\5&", __)
             gsub("\5|(^|" (___) ")\3("(___)"|$)", "",__)

11            ___=(___)___
         }
 5       return (______)__
      }

      BEGIN {
 1       OFS = "\f\r\t"
      }

 3    ($++NF = _____($!_))^_'
 1  1234567890123
    3210987654321

 2  ABCDefghi
    ihgfeDCBA

 3  코요태이하픽에
    에픽하이태요코
转角预定愛 2024-09-25 23:07:05

概述

在较高级别上,该模式匹配任何一个字符 .,但另外执行 grab$2 操作,该操作捕获匹配到的字符的反转“mate”第 2 组。此捕获是通过构建输入字符串的后缀来完成的,后缀的长度与当前位置之前的前缀长度相匹配。我们通过在一种模式上应用 assertSuffix 来实现这一点,该模式将后缀增加一个字符,并重复一次 forEachDotBehind。第 1 组捕获此后缀。在第 2 组中捕获的该后缀的第一个字符是所匹配字符的反转“mate”。

因此,用其“配对”替换每个匹配的字符具有反转字符串的效果。


工作原理:一个更简单的示例

为了更好地理解正则表达式模式的工作原理,我们首先将其应用于更简单的输入。另外,对于我们的替换模式,我们将“转储”出所有捕获的字符串,以便我们更好地了解发生了什么。这是一个 Java 版本:

System.out.println(
    "123456789"
        .replaceAll(REVERSE, "[$0; $1; $2]\n")
);

上面的打印内容(如 ideone.com 上所示):

[1; 9; 9]
[2; 89; 8]
[3; 789; 7]
[4; 6789; 6]
[5; 56789; 5]
[6; 456789; 4]
[7; 3456789; 3]
[8; 23456789; 2]
[9; 123456789; 1]

因此,例如 [3 ;第789章7]表示该点匹配3(在第0组中捕获),对应的后缀为789(第1组),其第一个字符为 7(第 2 组)。请注意,73 的“伴侣”。

                   current position after
                      the dot matched 3
                              ↓        ________
                      1  2 [3] 4  5  6 (7) 8  9
                      \______/         \______/
                       3 dots        corresponding
                       behind      suffix of length 3

请注意,角色的“伴侣”可能在其右侧或左侧。一个角色甚至可能是它自己的“伴侣”。


后缀是如何构建的:嵌套引用

负责匹配和构建不断增长的后缀的模式如下:

    ((.) \1?)
    |\_/    |
    | 2     |       "suffix := (.) + suffix
    |_______|                    or just (.) if there's no suffix"
        1

请注意,在组 1 的定义中是对其自身的引用(使用 \1),尽管它是可选的(使用 ?)。可选部分提供“基本情况”,这是组在不引用自身的情况下进行匹配的一种方式。这是必需的,因为当组尚未捕获任何内容时,尝试匹配组引用总是会失败。

一旦第 1 组捕获了某些内容,可选部分就永远不会在我们的设置中执行,因为我们上次捕获的后缀这次仍然存在,并且我们始终可以在该后缀的开头添加另一个字符 ( .)。该前置字符被捕获到组 2 中。

因此,该模式尝试将后缀增加一个点。因此,重复一次 forEachDotBehind 将产生一个后缀,其长度恰好等于当前位置之前的前缀长度。


assertSuffixforEachDotBehind 如何工作:元模式抽象

请注意,到目前为止,我们已将 assertSuffixforEachDotBehind 视为黑匣子。事实上,将这个讨论留到最后是一个故意的行为:名称和简短的文档表明他们做什么,这对于我们编写和阅读我们的 REVERSE 来说已经足够了。 > 图案!

经过仔细检查,我们发现这些抽象的 Java 和 C# 实现略有不同。这是由于两个正则表达式引擎之间的差异造成的。

.NET 正则表达式引擎允许在后视中使用完整的正则表达式,因此这些元模式在这种风格中看起来更加自然。

  • AssertSuffix(pattern) := (?=.*$(?<=pattern)),即我们使用向前查找一直到字符串末尾,然后使用嵌套向后查找将模式与后缀进行匹配。
  • ForEachDotBehind(assertion) := (?<=(?:.assertion)*),即我们简单地在后行中匹配 .*,将断言与非捕获组内的点。

由于 Java 并不正式支持无限长度后向查找(但在某些情况下它仍然可以工作),因此它的对应部分有点尴尬:

  • assertSuffix(pattern) := (?<=(?=^.*? pattern$).*),即我们使用后向查找一直到字符串的开头,然后使用嵌套的先行查找来匹配整个字符串 em>,在后缀模式前添加 .*? 来勉强匹配一些不相关的前缀。
  • forEachDotBehind(assertion) := (?<=^(?:.assertion)*?),即我们使用不情愿重复的锚定后向,即 ^.*? (同样将断言与非捕获组内的点一起标记)。

应该注意的是,虽然这些元模式的 C# 实现在 Java 中不起作用,但 Java 实现在 C# 中却可以工作(参见 ideone.com)。因此,实际上没有必要为 C# 和 Java 提供不同的实现,但 C# 实现故意利用更强大的 .NET 正则表达式引擎后向支持来更自然地表达模式。

因此,我们展示了使用元模式抽象的好处:

  • 我们可以独立开发、检查、测试、优化等这些元模式实现,或许可以利用特定风格的功能来获得额外的性能和/或可读性。
  • 一旦这些构建块被开发出来并经过充分测试,我们就可以简单地将它们用作更大模式的一部分,这使我们能够在更高层次上表达想法,以获得更可读、更可维护、更可移植的解决方案。
  • 元模式促进重用,程序化生成意味着更少的重复

虽然这个概念的特殊表现相当原始,但也可以进一步发展,开发一个更健壮的程序化模式生成框架,其中包含经过充分测试和优化的库元模式。

另请参阅


结束语

需要重申的是,在实践中使用正则表达式反转字符串不是一个好主意。它比必要的要复杂得多,而且性能也很差。

也就是说,本文表明它实际上可以完成,并且当使用元模式抽象在更高级别表示时,该解决方案实际上非常可读。作为解决方案的关键组成部分,嵌套引用再次在另一个引人入胜的示例中展示。

不太明显的是,也许这篇文章还表明了解决一开始看起来很困难(甚至“不可能”)的问题所需的决心。也许它还表明​​了对主题的更深入理解所带来的清晰思维,这是大量研究和努力工作的结果。

毫无疑问,正则表达式可能是一个令人生畏的主题,当然它并不是为了解决您的所有问题而设计的。然而,这并不是可恨的无知的借口,如果你愿意学习的话,这将是一个令人惊讶的深渊。

Overview

At a high level, the pattern matches any one character ., but additionally performs a grab$2 action, which captures the reversal "mate" of the character that was matched into group 2. This capture is done by building a suffix of the input string whose length matches the length of the prefix up to the current position. We do this by applying assertSuffix on a pattern that grows the suffix by one character, repeating this once forEachDotBehind. Group 1 captures this suffix. The first character of that suffix, captured in group 2, is the reversal "mate" for the character that was matched.

Thus, replacing each matched character with its "mate" has the effect of reversing a string.


How it works: a simpler example

To better understand how the regex pattern works, let's first apply it on a simpler input. Also, for our replacement pattern, we'll just "dump" out all the captured strings so we get a better idea of what's going on. Here's a Java version:

System.out.println(
    "123456789"
        .replaceAll(REVERSE, "[$0; $1; $2]\n")
);

The above prints (as seen on ideone.com):

[1; 9; 9]
[2; 89; 8]
[3; 789; 7]
[4; 6789; 6]
[5; 56789; 5]
[6; 456789; 4]
[7; 3456789; 3]
[8; 23456789; 2]
[9; 123456789; 1]

Thus, e.g. [3; 789; 7] means that the dot matched 3 (captured in group 0), the corresponding suffix is 789 (group 1), whose first character is 7 (group 2). Note that 7 is 3's "mate".

                   current position after
                      the dot matched 3
                              ↓        ________
                      1  2 [3] 4  5  6 (7) 8  9
                      \______/         \______/
                       3 dots        corresponding
                       behind      suffix of length 3

Note that a character's "mate" may be to its right or left. A character may even be its own "mate".


How the suffix is built: nested reference

The pattern responsible for matching and building the growing suffix is the following:

    ((.) \1?)
    |\_/    |
    | 2     |       "suffix := (.) + suffix
    |_______|                    or just (.) if there's no suffix"
        1

Note that within the definition of group 1 is a reference to itself (with \1), though it is optional (with ?). The optional part provides the "base case", a way for the group to match without the reference to itself. This is required because an attempt to match a group reference always fails when the group hasn't captured anything yet.

Once group 1 captures something, the optional part is never exercised in our setup, since the suffix that we just captured last time will still be there this time, and we can always prepend another character to the beginning of this suffix with (.). This prepended character is captured into group 2.

Thus this pattern attempts to grow the suffix by one dot. Repeating this once forEachDotBehind will therefore results in a suffix whose length is exactly the length of the prefix up to our current position.


How assertSuffix and forEachDotBehind work: meta-pattern abstractions

Note that so far we've treated assertSuffix and forEachDotBehind as blackboxes. In fact, leaving this discussion for last is a deliberate act: the names and the brief documentation suggest WHAT they do, and this was enough information for us to write and read our REVERSE pattern!

Upon closer inspection, we see that the Java and C# implementations of these abstractions slightly differ. This is due to the differences between the two regex engines.

The .NET regex engine allows full regular expression in a lookbehind, so these meta-patterns look a lot more natural in that flavor.

  • AssertSuffix(pattern) := (?=.*$(?<=pattern)), i.e. we use a lookahead to go all the way to the end of the string, then use a nested lookbehind to match the pattern against a suffix.
  • ForEachDotBehind(assertion) := (?<=(?:.assertion)*), i.e. we simply match .* in a lookbehind, tagging the assertion along with the dot inside a non-capturing group.

Since Java's doesn't officially support infinite-length lookbehind (but it works anyway under certain circumstances), its counterpart is a bit more awkward:

  • assertSuffix(pattern) := (?<=(?=^.*?pattern$).*), i.e. we use a lookbehind to go all the way to the beginning of the string, then use a nested lookahead to match the entire string, prepending the suffix pattern with .*? to reluctantly match some irrelevant prefix.
  • forEachDotBehind(assertion) := (?<=^(?:.assertion)*?), i.e. we use an anchored lookbehind with reluctant repetition, i.e. ^.*? (and likewise tagging the assertion along with the dot inside a non-capturing group).

It should be noted that while the C# implementation of these meta-patterns doesn't work in Java, the Java implementation DOES work in C# (see on ideone.com). Thus, there is no actual need to have different implementations for C# and Java, but the C# implementation deliberately took advantage of the more powerful .NET regex engine lookbehind support to express the patterns more naturally.

We have thus shown the benefits of using meta-pattern abstractions:

  • We can independently develop, examine, test, optimize, etc. these meta-patterns implementations, perhaps taking advantage of flavor-specific features for extra performance and/or readability.
  • Once these building blocks are developed and well-tested, we can simply use them as parts of a bigger pattern, which allows us to express ideas at higher levels for more readable, more maintainable, more portable solutions.
  • Meta-patterns promote reuse, and programmatic generation means there's less duplication

While this particular manifestation of the concept is rather primitive, it's also possible to take this further and develop a more robust programmatic pattern generation framework, with a library of well-tested and optimized meta-patterns.

See also


Closing thoughts

It needs to be reiterated that reversing a string with regex is NOT a good idea in practice. It's way more complicated than necessary, and the performance is quite poor.

That said, this article shows that it CAN in fact be done, and that when expressed at higher levels using meta-pattern abstractions, the solution is in fact quite readable. As a key component of the solution, the nested reference is showcased once again in what is hopefully another engaging example.

Less tangibly, perhaps the article also shows the determination required to solve a problem that may seem difficult (or even "impossible") at first. Perhaps it also shows the clarity of thought that comes with a deeper understanding of a subject matter, a result of numerous studies and hard work.

No doubt regex can be an intimidating subject, and certainly it's not designed to solve all of your problems. This is no excuse for hateful ignorance, however, and this is one surprisingly deep well of knowledge if you're willing to learn.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文