我的正则表达式导致 Java 中的堆栈溢出;我缺少什么?
我正在尝试将正则表达式与扫描仪一起使用来匹配文件中的字符串。正则表达式适用于除此行之外的文件的所有内容:
DNA="ITTTAITATIATYAAAYIYI[....]ITYTYITTIYAIAIYIT"
在实际文件中,省略号代表数千个字符。
当读取文件的循环到达包含基数的行时,会发生堆栈溢出错误。
这是循环:
while (scanFile.hasNextLine()) {
final String currentLine = scanFile.findInLine(".*");
System.out.println("trying to match '" + currentLine + "'");
Scanner internalScanner = new Scanner(currentLine);
String matchResult = internalScanner.findInLine(Constants.ANIMAL_INFO_REGEX);
assert matchResult != null : "there's no reason not to find a match";
matches.put(internalScanner.match().group(1), internalScanner.match().group(2));
scanFile.nextLine();
}
和正则表达式:
static final String ANIMAL_INFO_REGEX = "([a-zA-Z]+) *= *\"(([a-zA-Z_.]| |\\.)+)";
这是失败跟踪:
java.lang.StackOverflowError
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3360)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
...etc (it's all regex).
非常感谢!
I am attempting to use a regular expression with Scanner to match a string from a file. The regex works with all of the contents of the file except for this line:
DNA="ITTTAITATIATYAAAYIYI[....]ITYTYITTIYAIAIYIT"
in the actual file, the ellipsis represents several thousand more characters.
When the loop that reads the file arrives on the line containing the bases, a stack overflow error occurs.
Here is the loop:
while (scanFile.hasNextLine()) {
final String currentLine = scanFile.findInLine(".*");
System.out.println("trying to match '" + currentLine + "'");
Scanner internalScanner = new Scanner(currentLine);
String matchResult = internalScanner.findInLine(Constants.ANIMAL_INFO_REGEX);
assert matchResult != null : "there's no reason not to find a match";
matches.put(internalScanner.match().group(1), internalScanner.match().group(2));
scanFile.nextLine();
}
and the regex:
static final String ANIMAL_INFO_REGEX = "([a-zA-Z]+) *= *\"(([a-zA-Z_.]| |\\.)+)";
Here's the failure trace:
java.lang.StackOverflowError
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3360)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
...etc (it's all regex).
Thanks so much!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
尝试这个正则表达式的简化版本,它删除了一些不必要的
|
运算符(这可能导致正则表达式引擎执行大量分支)并包含行锚点的开头和结尾。Try this simplified version of your regex that removes some unnecessary
|
operators (which might have been causing the regex engine to do a lot of branching) and includes beginning and end of line anchors.阅读此内容以了解问题: http://www.regular-expressions.info/catastropic.html ...然后使用其他建议之一
read this to understand the problem: http://www.regular-expressions.info/catastrophic.html ... and then use one of the other suggestions
正如其他人所说,您的正则表达式的效率远低于应有的效率。我会更进一步,使用所有格量词:
但是您使用扫描仪的方式也没有多大意义。无需使用
findInLine(".*")
来读取该行;这就是nextLine()
的作用。并且您不需要创建另一个扫描仪来应用您的正则表达式;只需使用匹配器即可。...
As the others have said, your regex is much less efficient than it should be. I'd take it a step further and use possessive quantifiers:
But the way you're using the Scanner doesn't make much sense, either. There's no need to use
findInLine(".*")
to read the line; that's whatnextLine()
does. And you don't need to create another Scanner to apply your regex; just use a Matcher....
这看起来像 bug 5050507 。我同意 Asaph 的观点,即取消交替应该有所帮助;该错误特别指出“尽可能避免交替”。我认为你可以更简单:
This looks like bug 5050507 . I agree with Asaph that removing the alternation should help; the bug specifically says "Avoid alternation whenever possible". I think you can go probably even simpler: