Pattern.matches() 给出 StackOverflowError

发布于 2024-12-08 10:56:59 字数 957 浏览 0 评论 0原文

我正在使用 java 的 Pattern.matches 将数据块与正则表达式匹配。数据块可以是单行或多行。问题是，一旦我的数据超过 15 行（通常超过 17-18 行），我就会开始出现 stackoverflowerror。对于少于 15 行的数据，正则表达式可以正常工作。

正则表达式的格式如下：
域名->空间-> , ->空间->数量->空间-> , ->空间->数量->换行符

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";

我用来测试这个正则表达式的数据块是这个

abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456

这是代码：

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
boolean valid = Pattern.matches(regex, data); //fails here

原文

I'm using java's Pattern.matches to match a block of data to a regex. The block of data can be a single line or multiple lines. The problem is that once my data becomes more than 15 lines (typically more than 17-18 lines), i start getting stackoverflowerror. For data less than 15 lines the regex works fine.

The Regex is of this format:
domainname -> space -> , -> space -> number -> space -> , -> space -> number -> newline

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";

The data block i use to test against this regex is this

abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456

This is the code:

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
boolean valid = Pattern.matches(regex, data); //fails here

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白芷 2024-12-15 10:56:59

我无法告诉你这个错误的原因；正则表达式本身很好，不会出现灾难性的回溯或任何其他明显的错误。

也许您可以通过使用所有格量词 (< code>++ 代替 +，*+ 代替 *，{2,}+ > 代替{2,} 等）。另外，您不需要捕获组（感谢托马斯），所以我将它们更改为非捕获组：

"(?:(?:[a-zA-Z0-9][a-zA-Z0-9-]*+\\.)++([a-zA-Z]{2,}+)\\s*+,\\s*+\\d++\\s*+,\\s*+\\d++(\r?+\n)?+)++"

这不会改变正则表达式的行为（除了删除不必要的锚点，因为您使用 Pattern.matches())，但也许它有助于避免 StackOverflows。我没有安装Java SDK，所以我无法自己测试。

I can't tell you the reason for this error; the regex itself is fine and not subject to catastrophic backtracking or any other obvious error.

Perhaps you can reduce the number of backtracking positions the regex engine saves by using possessive quantifiers (++ instead of +, *+ instead of *, {2,}+ instead of {2,} etc.). Also, you don't need the capturing groups (thanks Thomas), so I've changed them into non-capturing ones:

"(?:(?:[a-zA-Z0-9][a-zA-Z0-9-]*+\\.)++([a-zA-Z]{2,}+)\\s*+,\\s*+\\d++\\s*+,\\s*+\\d++(\r?+\n)?+)++"

This won't change the behaviour of the regex (except for the removal of the unnecessary anchors since you're using Pattern.matches()), but perhaps it helps avoid StackOverflows. I don't have a Java SDK installed, so I can't test it myself, though.

回复收藏 0 原文

橙味迷妹 2024-12-15 10:56:59

您可以尝试使用原子组 ((?>expression)) 来防止回溯：

这是一个使用正则表达式进行 1000 行块失败的测试，但现在成功了（需要一段时间，因此我仅使用 ~~5000~~ 20000 进行了测试:) )：

String regex = "(?>(?>[a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+(?>[a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(?>\\r?\\n)?)+";

StringBuilder input = new StringBuilder();

for( int i = 0; i < 1000000; ++i) {
  input.append("abc.com, 123, 456\n");
}

Pattern p = Pattern.compile( regex );
Matcher m = p.matcher( input );

System.out.println(m.matches());

所以毕竟，它可能仍然是一个回溯问题。

更新：只是让该测试运行 20000 行，仍然没有失败。这至少是以前的 20 倍。 :)

更新 2：再次查看我的测试，我发现了缓慢的部分，即字符串连接。（o..O）。我已经更新了测试并使用了 100 万行，仍然没有失败。 :)

You might try and use atomic groups ((?>expression)) to prevent backtracking:

Here's a test that failed with a block of 1000 lines using your regex but succeeds now (takes a while, thus I only tested with ~~5000~~ 20000 :) ):

String regex = "(?>(?>[a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+(?>[a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(?>\\r?\\n)?)+";

StringBuilder input = new StringBuilder();

for( int i = 0; i < 1000000; ++i) {
  input.append("abc.com, 123, 456\n");
}

Pattern p = Pattern.compile( regex );
Matcher m = p.matcher( input );

System.out.println(m.matches());

So after all, it might still be a backtracking problem.

Update: just let that test run with 20000 lines and still didn't fail. That's at least 20 times as much as before. :)

Update 2: looking at my test again I found the slow part, the string concatenation. (o..O). I've updated the test and used 1 Million lines, still no fail. :)

回复收藏 0 原文

另类 2024-12-15 10:56:59

问题是你的正则表达式太复杂了。您处理的每一行输入都会产生（我认为）10 个回溯点，并且至少其中一些似乎是由正则表达式引擎递归处理的。这可能是几百个堆栈帧，足以导致 StackOverflowError。

IMO，您需要修改模式以使其匹配一组/行数据。然后重复调用 Matcher.find 来解析每一行。我希望您会发现这样更快。

以其他方式优化正则表达式，同时仍尝试一次性匹配整个块可能行不通。您也许能够让它匹配 N 倍多的数据行，但是当您增加输入中的行数时，您可能会再次遇到相同的问题。

即使您确实让它作为多行正则表达式工作，它也有可能无法与 Java 正则表达式库的其他实现一起工作；例如，在较旧的 Oracle JRE 或非 Oracle 实现中。

我同意其他答案，这不是“灾难性回溯”的例子。相反，它是正则表达式引擎处理回溯点的方式与当您给它多行输入时回溯点太多这一事实之间的相互作用。

回复收藏 0 原文

你好，陌生人 2024-12-15 10:56:59

我已经重现了这个问题，但仅限于更大的字符串。

$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.2)    (6b22-1.10.2-0ubuntu1~11.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

我的测试代码：

public class Testje
{
    public static void main(String... args)
    {
        String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
        String data = "";
        for (int i = 0; i<224; i++) data += "abc.com, 123, 456\n";
        System.out.println(data.matches(regex));
    }
}

对于 for 循环中任何小于 224 的值，代码运行良好。对于该行的 224 个及更多副本，我得到了巨大的堆栈跟踪。

哦，请注意，使用 (?: groups 不会改变仍然有效的字符串的大小。

I've reproduced this problem, but only for much larger strings.

$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.2)    (6b22-1.10.2-0ubuntu1~11.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

My test code:

public class Testje
{
    public static void main(String... args)
    {
        String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
        String data = "";
        for (int i = 0; i<224; i++) data += "abc.com, 123, 456\n";
        System.out.println(data.matches(regex));
    }
}

For anything smaller than 224 in that for loop, the code runs fine. For 224 and more copies of that line, I get a huge stack trace.

Oh, note that using (?: groups does not change the size of the string that still works.

回复收藏 0 原文

~没有更多了~