这个 PCRE 模式如何检测回文?

发布于 2024-09-24 05:24:48 字数 1771 浏览 8 评论 0原文

这个问题是一个教育演示,展示了在 PCRE 模式中如何使用前瞻、嵌套引用和条件来匹配所有回文,包括无法通过 中给出的递归模式匹配的回文。 PCRE 手册页。

检查 PHP 代码片段中的 PCRE 模式:

$palindrome = '/(?x)
^
  (?:
      (.) (?=
              .*
              (
                \1
                (?(2) \2 | )
              )
              $
          )
  )*
  .?
  \2?
$


/';

此模式似乎可以检测回文,如本测试用例所示 (另请参阅 ideone.com):

$tests = array(
  # palindromes
  '',
  'a',
  'aa',
  'aaa',
  'aba',
  'aaaa',
  'abba',
  'aaaaa',
  'abcba',
  'ababa',

  # non-palindromes
  'aab',
  'abab',
  'xyz',
);

foreach ($tests as $test) {
  echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);  
}

那么这种模式是如何工作的呢?


备注

此模式使用嵌套引用,这与 这个 Java 正则表达式如何检测回文?,但与 Java 模式不同的是,有没有lookbehind(但它确实使用条件)。

另请注意,PCRE 手册页 提供了一个递归模式来匹配某些回文:

# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

手册页警告此递归模式无法检测所有回文(请参阅:为什么此递归正则表达式仅在字符重复 2n - 1 次时才匹配?也在 ideone.com 上),但是这个问题中提出的嵌套引用/正向前瞻模式可以。

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.

Examine this PCRE pattern in PHP snippet:

$palindrome = '/(?x)
^
  (?:
      (.) (?=
              .*
              (
                \1
                (?(2) \2 | )
              )
              $
          )
  )*
  .?
  \2?
$


/';

This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):

$tests = array(
  # palindromes
  '',
  'a',
  'aa',
  'aaa',
  'aba',
  'aaaa',
  'abba',
  'aaaaa',
  'abcba',
  'ababa',

  # non-palindromes
  'aab',
  'abab',
  'xyz',
);

foreach ($tests as $test) {
  echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);  
}

So how does this pattern work?


Notes

This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).

Also, note that the PCRE man page presents a recursive pattern to match some palindromes:

# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

昨迟人 2024-10-01 05:24:48

让我们尝试通过构建正则表达式来理解它。首先,回文必须以相反方向的相同字符序列开始和结束:

^(.)(.)(.) ... \3\2\1$

我们要重写它,使得 ... 后面仅跟随有限长度的模式,这样它我们可以将其转换为 *。通过前瞻可以实现这一点:

^(.)(?=.*\1$)
 (.)(?=.*\2\1$)
 (.)(?=.*\3\2\1$) ...

但仍然存在不常见的部分。如果我们可以“记录”之前捕获的组呢?如果可能的话,我们可以将其重写为:

^(.)(?=.*(?<record>\1\k<record>)$)   # \1     = \1 + (empty)
 (.)(?=.*(?<record>\2\k<record>)$)   # \2\1   = \2 + \1
 (.)(?=.*(?<record>\3\k<record>)$)   # \3\2\1 = \3 + \2\1
 ...

可以将其转换为

^(?: 
    (.)(?=.*(\1\2)$)
 )*

Almost good,除了 \2 (记录的捕获)最初不为空。它只是无法匹配任何东西。如果记录的捕获不存在,我们需要它匹配空。这就是条件表达式的渗透方式。

(?(2)\2|)   # matches \2 if it exist, empty otherwise.

因此我们的表达式变为

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*

Now it matches the 回文的前半部分。下半场怎么样?嗯,第一半部分匹配后,记录的捕获 \2 将包含第二半部分。所以让我们把它放在最后。

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*\2$

我们还想处理奇数长度的回文。上半场和下半场之间会有一个自由角色。

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2$

这很有效除非在一种情况下——只有 1 个字符。这又是由于 \2 没有匹配任何内容。所以

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2?$
#      ^ since \2 must be at the end in the look-ahead anyway.

Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:

^(.)(.)(.) ... \3\2\1$

we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:

^(.)(?=.*\1$)
 (.)(?=.*\2\1$)
 (.)(?=.*\3\2\1$) ...

but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:

^(.)(?=.*(?<record>\1\k<record>)$)   # \1     = \1 + (empty)
 (.)(?=.*(?<record>\2\k<record>)$)   # \2\1   = \2 + \1
 (.)(?=.*(?<record>\3\k<record>)$)   # \3\2\1 = \3 + \2\1
 ...

which could be converted into

^(?: 
    (.)(?=.*(\1\2)$)
 )*

Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.

(?(2)\2|)   # matches \2 if it exist, empty otherwise.

so our expression becomes

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*

Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*\2$

We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2$

This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2?$
#      ^ since \2 must be at the end in the look-ahead anyway.
℡Ms空城旧梦 2024-10-01 05:24:48

我想提出我自己的解决方案。
这是我不久前编写的一个正则表达式,用于使用 PCRE/PCRE2 解决匹配回文

^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$

示例:
https://regex101.com/r/xvZ1H0/1

I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2

^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$

Example:
https://regex101.com/r/xvZ1H0/1

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文