使用正则表达式解析大字符串时出现 java.lang.StackOverflowError
这是我的正则表达式
((?:(?:'[^']*')|[^;])*)[;]
,它用分号标记字符串。例如,
Hello world; I am having a problem; using regex;
结果是三个字符串
Hello world
I am having a problem
using regex
但是当我使用较大的输入字符串时,我收到此错误
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
这是如何引起的以及如何解决它?
This is my Regex
((?:(?:'[^']*')|[^;])*)[;]
It tokenizes a string on semicolons. For example,
Hello world; I am having a problem; using regex;
Result is three strings
Hello world
I am having a problem
using regex
But when I use a large input string I get this error
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
How is this caused and how can I solve it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不幸的是,Java 的内置正则表达式支持对于包含重复替代路径(即
(A|B)*
)的正则表达式存在问题。这被编译成递归调用,当用于非常大的字符串时会导致 StackOverflow 错误。一个可能的解决方案是重写您的正则表达式,不使用重复的替代方案,但如果您的目标是用分号标记字符串,那么您实际上根本不需要复杂的正则表达式,只需使用 String.split() 用一个简单的
";"
作为参数。Unfortunately, Java's builtin regex support has problems with regexes containing repetitive alternative paths (that is,
(A|B)*
). This is compiled into a recursive call, which results in a StackOverflow error when used on a very large string.A possible solution is to rewrite your regex to not use a repititive alternative, but if your goal is to tokenize a string on semicolons, you don't need a complex regex at all really, just use String.split() with a simple
";"
as the argument.如果您确实需要使用会溢出堆栈的正则表达式,则可以通过将 -Xss40m 之类的内容传递给 JVM 来增加堆栈的大小。
If you really need to use a regex that overflows your stack, you can increase the size of your stack by passing something like -Xss40m to the JVM.
在
[^;]
之后添加+
可能会有所帮助,这样可以减少重复次数。是不是还有一些结构说“如果正则表达式匹配到这一点,则不回溯”?也许这也能派上用场。 (更新:它被称为所有格量词)。
一种完全不同的替代方法是编写一个名为 splitQuoted(char quote, char seperator, CharSequence s) 的实用方法,该方法显式迭代字符串并记住是否看到奇数个引号。在该方法中,您还可以处理引号字符出现在带引号的字符串中时可能需要不转义的情况。
It might help to add a
+
after the[^;]
, so that you have fewer repetitions.Isn't there also some construct that says “if the regular expression matched up to this point, don't backtrace”? Maybe that comes in handy, too. (Update: it is called possessive quantifiers).
A completely different alternative is to write a utility method called
splitQuoted(char quote, char separator, CharSequence s)
that explicitly iterates over the string and remembers whether it has seen an odd number of quotes. In that method you could also handle the case that the quote character might need to be unescaped when it appears in a quoted string.