我的正则表达式导致 Java 中的堆栈溢出；我缺少什么？

发布于 2024-09-18 16:33:17 字数 2130 浏览 9 评论 0原文

我正在尝试将正则表达式与扫描仪一起使用来匹配文件中的字符串。正则表达式适用于除此行之外的文件的所有内容：

DNA="ITTTAITATIATYAAAYIYI[....]ITYTYITTIYAIAIYIT"

在实际文件中，省略号代表数千个字符。

当读取文件的循环到达包含基数的行时，会发生堆栈溢出错误。

这是循环：

while (scanFile.hasNextLine()) {
   final String currentLine = scanFile.findInLine(".*");
   System.out.println("trying to match '" + currentLine + "'");
   Scanner internalScanner = new Scanner(currentLine);
   String matchResult = internalScanner.findInLine(Constants.ANIMAL_INFO_REGEX);
   assert matchResult != null : "there's no reason not to find a match"; 
   matches.put(internalScanner.match().group(1), internalScanner.match().group(2));
   scanFile.nextLine();
  }

和正则表达式：

static final String ANIMAL_INFO_REGEX = "([a-zA-Z]+) *= *\"(([a-zA-Z_.]| |\\.)+)";

这是失败跟踪：

java.lang.StackOverflowError
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3360)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    ...etc (it's all regex).

非常感谢！

原文

I am attempting to use a regular expression with Scanner to match a string from a file. The regex works with all of the contents of the file except for this line:

DNA="ITTTAITATIATYAAAYIYI[....]ITYTYITTIYAIAIYIT"

in the actual file, the ellipsis represents several thousand more characters.

When the loop that reads the file arrives on the line containing the bases, a stack overflow error occurs.

Here is the loop:

while (scanFile.hasNextLine()) {
   final String currentLine = scanFile.findInLine(".*");
   System.out.println("trying to match '" + currentLine + "'");
   Scanner internalScanner = new Scanner(currentLine);
   String matchResult = internalScanner.findInLine(Constants.ANIMAL_INFO_REGEX);
   assert matchResult != null : "there's no reason not to find a match"; 
   matches.put(internalScanner.match().group(1), internalScanner.match().group(2));
   scanFile.nextLine();
  }

and the regex:

static final String ANIMAL_INFO_REGEX = "([a-zA-Z]+) *= *\"(([a-zA-Z_.]| |\\.)+)";

Here's the failure trace:

java.lang.StackOverflowError
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3360)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    ...etc (it's all regex).

Thanks so much!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄荷港 2024-09-25 16:33:18

尝试这个正则表达式的简化版本，它删除了一些不必要的 | 运算符（这可能导致正则表达式引擎执行大量分支）并包含行锚点的开头和结尾。

static final String ANIMAL_INFO_REGEX = "^([a-zA-Z]+) *= *\"([a-zA-Z_. ]+)\"$";

Try this simplified version of your regex that removes some unnecessary | operators (which might have been causing the regex engine to do a lot of branching) and includes beginning and end of line anchors.

static final String ANIMAL_INFO_REGEX = "^([a-zA-Z]+) *= *\"([a-zA-Z_. ]+)\"$";

回复收藏 0 原文

再浓的妆也掩不了殇 2024-09-25 16:33:18

阅读此内容以了解问题： http://www.regular-expressions.info/catastropic.html ...然后使用其他建议之一

回复收藏 0 原文

苍暮颜 2024-09-25 16:33:18

正如其他人所说，您的正则表达式的效率远低于应有的效率。我会更进一步，使用所有格量词：

"^([a-zA-Z]++) *+= *+\"([^\"]++)\"$"

但是您使用扫描仪的方式也没有多大意义。无需使用 findInLine(".*") 来读取该行；这就是 nextLine() 的作用。并且您不需要创建另一个扫描仪来应用您的正则表达式；只需使用匹配器即可。

static final Pattern ANIMAL_INFO_PATTERN = 
    Pattern.compile("^([a-zA-Z]++) *+= *+\"([^\"]++)\"$");

...

  Matcher lineMatcher = ANIMAL_INFO_PATTERN.matcher("");
  while (scanFile.hasNextLine()) {
    String currentLine = scanFile.nextLine();
    if (lineMatcher.reset(currentLine).matches()) {
      matches.put(lineMatcher.group(1), lineMatcher.group(2));
    }
  }

As the others have said, your regex is much less efficient than it should be. I'd take it a step further and use possessive quantifiers:

"^([a-zA-Z]++) *+= *+\"([^\"]++)\"$"

But the way you're using the Scanner doesn't make much sense, either. There's no need to use findInLine(".*") to read the line; that's what nextLine() does. And you don't need to create another Scanner to apply your regex; just use a Matcher.

static final Pattern ANIMAL_INFO_PATTERN = 
    Pattern.compile("^([a-zA-Z]++) *+= *+\"([^\"]++)\"$");

...

  Matcher lineMatcher = ANIMAL_INFO_PATTERN.matcher("");
  while (scanFile.hasNextLine()) {
    String currentLine = scanFile.nextLine();
    if (lineMatcher.reset(currentLine).matches()) {
      matches.put(lineMatcher.group(1), lineMatcher.group(2));
    }
  }

回复收藏 0 原文

生死何惧 2024-09-25 16:33:17

这看起来像 bug 5050507 。我同意 Asaph 的观点，即取消交替应该有所帮助；该错误特别指出“尽可能避免交替”。我认为你可以更简单：

"^([a-zA-Z]+) *= *\"([^\"]+)"

This looks like bug 5050507 . I agree with Asaph that removing the alternation should help; the bug specifically says "Avoid alternation whenever possible". I think you can go probably even simpler: