Pattern.matches() 给出 StackOverflowError
我正在使用 java 的 Pattern.matches 将数据块与正则表达式匹配。数据块可以是单行或多行。问题是,一旦我的数据超过 15 行(通常超过 17-18 行),我就会开始出现 stackoverflowerror。对于少于 15 行的数据,正则表达式可以正常工作。
正则表达式的格式如下:
域名->空间-> , ->空间->数量->空间-> , ->空间->数量->换行符
String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
我用来测试这个正则表达式的数据块是这个
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
这是代码:
String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
boolean valid = Pattern.matches(regex, data); //fails here
I'm using java's Pattern.matches to match a block of data to a regex. The block of data can be a single line or multiple lines. The problem is that once my data becomes more than 15 lines (typically more than 17-18 lines), i start getting stackoverflowerror. For data less than 15 lines the regex works fine.
The Regex is of this format:
domainname -> space -> , -> space -> number -> space -> , -> space -> number -> newline
String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
The data block i use to test against this regex is this
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
This is the code:
String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
boolean valid = Pattern.matches(regex, data); //fails here
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我无法告诉你这个错误的原因;正则表达式本身很好,不会出现灾难性的回溯或任何其他明显的错误。
也许您可以通过使用 所有格量词 (< code>++ 代替
+
,*+
代替*
,{2,}+
> 代替{2,}
等)。另外,您不需要捕获组(感谢托马斯),所以我将它们更改为非捕获组:这不会改变正则表达式的行为(除了删除不必要的锚点,因为您使用
Pattern.matches()
),但也许它有助于避免 StackOverflows。我没有安装Java SDK,所以我无法自己测试。I can't tell you the reason for this error; the regex itself is fine and not subject to catastrophic backtracking or any other obvious error.
Perhaps you can reduce the number of backtracking positions the regex engine saves by using possessive quantifiers (
++
instead of+
,*+
instead of*
,{2,}+
instead of{2,}
etc.). Also, you don't need the capturing groups (thanks Thomas), so I've changed them into non-capturing ones:This won't change the behaviour of the regex (except for the removal of the unnecessary anchors since you're using
Pattern.matches()
), but perhaps it helps avoid StackOverflows. I don't have a Java SDK installed, so I can't test it myself, though.您可以尝试使用原子组 (
(?>expression)
) 来防止回溯:这是一个使用正则表达式进行 1000 行块失败的测试,但现在成功了(需要一段时间,因此我仅使用
500020000 进行了测试:) ):所以毕竟,它可能仍然是一个回溯问题。
更新:只是让该测试运行 20000 行,仍然没有失败。这至少是以前的 20 倍。 :)
更新 2:再次查看我的测试,我发现了缓慢的部分,即字符串连接。 (o..O)。我已经更新了测试并使用了 100 万行,仍然没有失败。 :)
You might try and use atomic groups (
(?>expression)
) to prevent backtracking:Here's a test that failed with a block of 1000 lines using your regex but succeeds now (takes a while, thus I only tested with
500020000 :) ):So after all, it might still be a backtracking problem.
Update: just let that test run with 20000 lines and still didn't fail. That's at least 20 times as much as before. :)
Update 2: looking at my test again I found the slow part, the string concatenation. (o..O). I've updated the test and used 1 Million lines, still no fail. :)
问题是你的正则表达式太复杂了。您处理的每一行输入都会产生(我认为)10 个回溯点,并且至少其中一些似乎是由正则表达式引擎递归处理的。这可能是几百个堆栈帧,足以导致
StackOverflowError
。IMO,您需要修改模式以使其匹配一组/行数据。然后重复调用
Matcher.find
来解析每一行。我希望您会发现这样更快。以其他方式优化正则表达式,同时仍尝试一次性匹配整个块可能行不通。您也许能够让它匹配 N 倍多的数据行,但是当您增加输入中的行数时,您可能会再次遇到相同的问题。
即使您确实让它作为多行正则表达式工作,它也有可能无法与 Java 正则表达式库的其他实现一起工作;例如,在较旧的 Oracle JRE 或非 Oracle 实现中。
我同意其他答案,这不是“灾难性回溯”的例子。相反,它是正则表达式引擎处理回溯点的方式与当您给它多行输入时回溯点太多这一事实之间的相互作用。
The problem is that your regex is too complicated. Each line of input that you process results in (I think) 10 backtrack points, and at least some of these seem to be handled by the regex engine recursing. That could be a few hundred stack frames which would be enough to give you
StackOverflowError
.IMO, you need to modify the pattern so that it will match one group / line of data. Then call
Matcher.find
repeatedly to parse each line. I expect that you will find that this is faster.Optimizing the regex in other ways while still trying to match the entire block in one go probably won't work. You may be able to get it to match N times more lines of data, but as you increase the number of lines in the input you are likely to run into the same problem again.
And even if you do get it to work as a multi-line regex, there is a chance that it won't work with other implementations of the Java regex libraries; e.g. in older Oracle JREs or non-Oracle implementations.
I agree with the other answers that this is not an example of "catastrophic backtracking". Rather it is an interaction between the way that the regex engine handles backtrack points, and the fact that there are simply too many of them when you give it multiple lines of input.
我已经重现了这个问题,但仅限于更大的字符串。
我的测试代码:
对于 for 循环中任何小于 224 的值,代码运行良好。对于该行的 224 个及更多副本,我得到了巨大的堆栈跟踪。
哦,请注意,使用 (?: groups 不会改变仍然有效的字符串的大小。
I've reproduced this problem, but only for much larger strings.
My test code:
For anything smaller than 224 in that for loop, the code runs fine. For 224 and more copies of that line, I get a huge stack trace.
Oh, note that using (?: groups does not change the size of the string that still works.