为什么 Java 正则表达式引擎会在 + 上抛出 StringIndexOutOfBoundsException?重复?

发布于 2024-09-19 08:27:22 字数 1556 浏览 2 评论 0 原文

我编写了一个正则表达式模式来查找斐波那契数(不管为什么,我就是这么做的)。 好(参见 ideone.com):

    String FIBONACCI = 
        "(?x) .{0,2} | (?: (?=(\\2?)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)++ . ";

    for (int n = 0; n < 1000; n++) {
        String s = new String(new char[n]);
        if (s.matches(FIBONACCI)) {
            System.out.print(n + " ");
        }
    } // 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

它按预期工作得非常 正则表达式.info/possessive.html" rel="noreferrer">占有重复(即主“循环”上的++)至关重要,因为你不想回溯用这个匹配算法。然而,使重复可回溯(即只是主“循环”上的 +)不会导致不匹配,而是导致运行时异常! (如 ideone.com 上所见):

Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
    String index out of range: -1

    at java.lang.String.charAt(String.java:686)
    at java.lang.Character.codePointAt(Character.java:2335)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3344)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3994)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3966)
    at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3916)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Matcher.match(Matcher.java:1127)
    at java.util.regex.Matcher.matches(Matcher.java:502)
    at java.util.regex.Pattern.matches(Pattern.java:930)
    at java.lang.String.matches(String.java:2090)

有人可以解释一下这里发生了什么吗?这是 Java 正则表达式引擎中的错误吗?

I've written a regex pattern to find Fibonacci numbers (it doesn't matter why, I just did). It works wonderfully as expected (see on ideone.com):

    String FIBONACCI = 
        "(?x) .{0,2} | (?: (?=(\\2?)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)++ . ";

    for (int n = 0; n < 1000; n++) {
        String s = new String(new char[n]);
        if (s.matches(FIBONACCI)) {
            System.out.print(n + " ");
        }
    } // 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

A possessive repetition (i.e. ++ on the main "loop") is crucial, because you don't want backtracking with this matching algorithm. However, making the repetition backtrackable (i.e. just + on the main "loop") results not in mismatches, but rather a runtime exception!!! (as seen on ideone.com):

Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
    String index out of range: -1

    at java.lang.String.charAt(String.java:686)
    at java.lang.Character.codePointAt(Character.java:2335)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3344)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3994)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3966)
    at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3916)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Matcher.match(Matcher.java:1127)
    at java.util.regex.Matcher.matches(Matcher.java:502)
    at java.util.regex.Pattern.matches(Pattern.java:930)
    at java.lang.String.matches(String.java:2090)

Can someone explain what happened here? Is this a bug in the Java regex engine?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

征棹 2024-09-26 08:27:22

错误 ID 6984178

有许多与引擎抛出 StringIndexOutOfBoundsException 相关的错误 (参见:搜索结果。这一点已被报告并在内部被接受为 Bug ID 6984178(可能需要一段时间才能显示在外部数据库中)。

这是一个更简单的模式,可以重现该错误( 另请参阅 ideone.com):

System.out.println(
   "abaab".matches("(?x) (?: (?=(a+)) \\1 b )* x")
); // StringIndexOutOfBounds: -1

请注意使用 *? *+ 只是按预期返回 false

看起来问题是由于在前瞻中存在对捕获组的引用时尝试回溯贪婪重复而触发的:越界。索引是第一个和第二个a+ 之间的长度差(例如"aabaaaaab" 得到-3)。

人们可能必须调试 java.util.regex.Pattern源代码来查明错误的确切性质。


探索斐波那契模式

在 Java 引擎上,使用贪婪回溯 +

下面是一个更详细的代码片段,用于显示引擎在这种模式上的疯狂程度:(

String FIBONACCI = 
    "(?x) .{0,2} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+ . ";

for (int n = 0; n < 1000; n++) {
    String s = new String(new char[n]);
    try {
        if (s.matches(FIBONACCI)) {
            System.out.printf("%n%s", n);
        }
    } catch (StringIndexOutOfBoundsException e) {
        String index = e.getMessage().replace("String index out of range: ", "");
        System.out.printf(" <%s:%s>", n, index);
    }
}

稍作编辑的)输出是 (如 ideone.com 上所示):

0 1 2 3 <5:-1>
6 <7:-1> ... <12:-1> <13:-3>
14 <15:-3> ... <33:-3> <34:-8>
35 <36:-8> ... <88:-8> <89:-21>
90 <91:-21> ... <232:-21> <233:-55>
234 <235:-55> ... <609:-55> <610:-144>
611 <612:-144> ...

因此引擎尝试以某种方式访问​​ -1、-3、-8、-21 处的字符串索引、-55、-144 等。请注意,这些是所有其他斐波那契数,但为负数。另请注意,除了前几个数字之外,其余匹配项(6、14、35...)不是斐波那契数。


在 .NET 引擎上,使用贪婪回溯 +

虽然该模式最初是考虑到所有格量词的必要性而编写的,但实际上回溯重复仍然会产生正确的答案(假设引擎不是像 Java 那样有缺陷)。下面是 .NET 引擎上的 C# 实现(另请参阅 ideone.com):

Regex r = new Regex(
  @"(?x) ^.{0,1}$ | ^(?: (?=(\2?)) (?=(\2\3|^.)) (?=(\1)) \2)+ . $ "
);

for (int n = 0; n < 1000; n++) {
  if (r.IsMatch("".PadLeft(n))) {
    Console.Write("{0} ", n);
  }
}
// 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

如您所见,即使有回溯 +“循环”,输出也是正确的。事实上,正是因为它是一个回溯循环,所以特殊情况可以仅限于 .{0,1} 而不是 .{0,2}


在 Java 引擎上,不情愿地回溯 +?

这在 Java 中按预期工作。另外,因为不情愿,我们还可以将特殊情况限制为 .{0,1} (另请参阅 ideone.com):

String FIBONACCI = 
        "(?x) .{0,1} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+? . ";

关于算法 公式

模式利用了第二恒等式斐波那契数列

alt text

这可以通过归纳法来证明。

模式

让我们使用这个版本的模式(适用于 Java,锚定后也适用于 C#):

                                                     summation 
free-space!                                            "loop"
    ↓                                                     ↓
   (?x) .{0,1} | (?: (?=(\2|^)) (?=(\2\3|^.)) (?=(\1)) \2)+? .
        \____/   \___________________________________/  ↑    ↑  
      base case      inductive case                    +Fi  +1
     for n = 0,1
                     (assertions don't "count" toward sum)!
                        $1 := $2    (or initialized with 0)
                        $2 := $2+$3 (or initialized with 1)
                        $3 := $1    (a temporary variable for the "swap")

斐波那契数列的生成非常简单,基于 [$1, $2] := [$2, $2+ $1] 转换。由于断言是按顺序执行的,因此引入了临时变量(就像单赋值“伪代码”版本一样)。

Bug ID 6984178

There are many bugs related to the engine throwing StringIndexOutOfBoundsException (see: search results. This one in particular has been reported and internally accepted as Bug ID 6984178 (it may take a while for this to show up in the external database).

Here's a much simpler pattern that reproduces the bug (see also on ideone.com):

System.out.println(
   "abaab".matches("(?x) (?: (?=(a+)) \\1 b )* x")
); // StringIndexOutOfBounds: -1

Note that using *? or *+ simply returns false as expected.

It looks like the problem is triggered by the attempt to backtrack a greedy repetition when there's a reference to a capturing group inside a lookahead: the out of bounds index is the difference in length between the first and the second a+ (e.g. "aabaaaaab" gets -3).

One would likely have to debug the java.util.regex.Pattern source code to pinpoint the exact nature of the bug.


Exploring the Fibonacci pattern

On the Java engine, with greedy backtracking +

Here's a more verbose snippet to show just how bonkers the engine gets on this pattern:

String FIBONACCI = 
    "(?x) .{0,2} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+ . ";

for (int n = 0; n < 1000; n++) {
    String s = new String(new char[n]);
    try {
        if (s.matches(FIBONACCI)) {
            System.out.printf("%n%s", n);
        }
    } catch (StringIndexOutOfBoundsException e) {
        String index = e.getMessage().replace("String index out of range: ", "");
        System.out.printf(" <%s:%s>", n, index);
    }
}

The (slightly edited) output is (as seen on ideone.com):

0 1 2 3 <5:-1>
6 <7:-1> ... <12:-1> <13:-3>
14 <15:-3> ... <33:-3> <34:-8>
35 <36:-8> ... <88:-8> <89:-21>
90 <91:-21> ... <232:-21> <233:-55>
234 <235:-55> ... <609:-55> <610:-144>
611 <612:-144> ...

So somehow the engine tries to access string indices at -1, -3, -8, -21, -55, -144, etc. Note that these are every other Fibonacci number, but negative. Note also that beyond the first few numbers, the rest of the matches (6, 14, 35, ...) are NOT Fibonacci numbers.


On the .NET engine, with greedy backtracking +

While the pattern was originally written with the necessity for possessive quantifier in mind, in fact a backtracking repetition will still yield the correct answer (assuming the engine isn't buggy like Java's). Here's a C# implementation on the .NET engine (see also on ideone.com):

Regex r = new Regex(
  @"(?x) ^.{0,1}$ | ^(?: (?=(\2?)) (?=(\2\3|^.)) (?=(\1)) \2)+ . $ "
);

for (int n = 0; n < 1000; n++) {
  if (r.IsMatch("".PadLeft(n))) {
    Console.Write("{0} ", n);
  }
}
// 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

As you can see, the output is correct, even with a backtracking + "loop". In fact, precisely because it's a backtracking loop, the special case can be limited to just .{0,1} instead of .{0,2}.


On the Java engine, with reluctant backtracking +?

This works in Java as expected. Also, because it's reluctant, we can also limit the special case to just .{0,1} (see also on ideone.com):

String FIBONACCI = 
        "(?x) .{0,1} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+? . ";

On the algorithm

The formula

The pattern exploits the Second Identity of Fibonacci Numbers:

alt text

This can be proven by induction.

The pattern

Let's use this version of the pattern (which works in Java, and when anchored, also works in C#):

                                                     summation 
free-space!                                            "loop"
    ↓                                                     ↓
   (?x) .{0,1} | (?: (?=(\2|^)) (?=(\2\3|^.)) (?=(\1)) \2)+? .
        \____/   \___________________________________/  ↑    ↑  
      base case      inductive case                    +Fi  +1
     for n = 0,1
                     (assertions don't "count" toward sum)!
                        $1 := $2    (or initialized with 0)
                        $2 := $2+$3 (or initialized with 1)
                        $3 := $1    (a temporary variable for the "swap")

The Fibonacci sequence generation is straightforward, based on the [$1, $2] := [$2, $2+$1] transition. Since the assertions are performed sequentially, a temporary variable is introduced (just like the single-assignment "pseudocode" version).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文