使用正则表达式解析大字符串时出现 java.lang.StackOverflowError

发布于 2024-12-05 14:52:25 字数 942 浏览 1 评论 0原文

这是我的正则表达式

((?:(?:'[^']*')|[^;])*)[;]

，它用分号标记字符串。例如，

Hello world; I am having a problem; using regex;

结果是三个字符串

Hello world
I am having a problem
using regex

但是当我使用较大的输入字符串时，我收到此错误

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)

这是如何引起的以及如何解决它？

原文

This is my Regex

((?:(?:'[^']*')|[^;])*)[;]

It tokenizes a string on semicolons. For example,

Hello world; I am having a problem; using regex;

Result is three strings

Hello world
I am having a problem
using regex

But when I use a large input string I get this error

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)

How is this caused and how can I solve it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

噩梦成真你也成魔 2024-12-12 14:52:25

不幸的是，Java 的内置正则表达式支持对于包含重复替代路径（即 (A|B)*）的正则表达式存在问题。这被编译成递归调用，当用于非常大的字符串时会导致 StackOverflow 错误。

一个可能的解决方案是重写您的正则表达式，不使用重复的替代方案，但如果您的目标是用分号标记字符串，那么您实际上根本不需要复杂的正则表达式，只需使用 String.split() 用一个简单的";" 作为参数。

回复收藏 0 原文

浪漫之都 2024-12-12 14:52:25

如果您确实需要使用会溢出堆栈的正则表达式，则可以通过将 -Xss40m 之类的内容传递给 JVM 来增加堆栈的大小。

回复收藏 0 原文

莫多说 2024-12-12 14:52:25

在 [^;] 之后添加 + 可能会有所帮助，这样可以减少重复次数。

是不是还有一些结构说“如果正则表达式匹配到这一点，则不回溯”？也许这也能派上用场。（更新：它被称为所有格量词）。

一种完全不同的替代方法是编写一个名为 splitQuoted(char quote, char seperator, CharSequence s) 的实用方法，该方法显式迭代字符串并记住是否看到奇数个引号。在该方法中，您还可以处理引号字符出现在带引号的字符串中时可能需要不转义的情况。

'I'm what I am', said the fox; and he disappeared.
'I\'m what I am', said the fox; and he disappeared.
'I''m what I am', said the fox; and he disappeared.

It might help to add a + after the [^;], so that you have fewer repetitions.

Isn't there also some construct that says “if the regular expression matched up to this point, don't backtrace”? Maybe that comes in handy, too. (Update: it is called possessive quantifiers).

A completely different alternative is to write a utility method called splitQuoted(char quote, char separator, CharSequence s) that explicitly iterates over the string and remembers whether it has seen an odd number of quotes. In that method you could also handle the case that the quote character might need to be unescaped when it appears in a quoted string.

'I'm what I am', said the fox; and he disappeared.
'I\'m what I am', said the fox; and he disappeared.
'I''m what I am', said the fox; and he disappeared.

回复收藏 0 原文

~没有更多了~