在java中使用字符串匹配时出现stackoverflow异常

发布于 2024-10-30 08:40:10 字数 2781 浏览 0 评论 0原文

对于我正在做的一个小型大学项目,我需要从以字符串形式给出的 html 中提取代码示例。 更准确地说,我需要从该 html 字符串中获取 之间的所有内容。

我用 Java 编写,并使用 String.match 来做到这一点。

我的代码:

public static ArrayList<String> extractByHTMLtagDelimiters(String source, String startDelimiter, String endDelimiter){
ArrayList<String> results = new ArrayList<String>();
if (source.matches("([\t\n\r]|.)*" + startDelimiter + "([\t\n\r]|.)*" + endDelimiter)){
    //source has some code samples in it
    //get array entries of the form: {Some code}</startDelimiter>{something else}
    String[] splittedSource = source.split(startDelimiter);
        for (String sourceMatch : splittedSource){
        if (sourceMatch.matches("([\t\n\r]|.)*" + endDelimiter + "([\t\n\r]|.)*")){
            //current string has code sample in it (with some body leftovers)
            //the code sample located before the endDelimiter - extract it
            String codeSample = (sourceMatch.split(endDelimiter))[0];
            //add the code samples to results
            results.add(codeSample);
        }
        }
}
return results;

iv'e 尝试从大约 1300 个字符的 html 中提取该样本,并得到了相当大的异常:(它持续了几十行)

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)

我发现了以下错误报告: https://bugs.java.com/bugdatabase/view_bug?bug_id=5050507

我能做些什么来仍然使用 string.match 吗?如果没有,你能推荐一些其他方法来做到这一点,而不需要我自己实现html解析吗?

非常感谢, 配音。

For a little university project i'm doing, i need to extract code samples from html given as a string.
To by more precise, i need to get from that html string, everything in between <code> and </code>.

I'm writing in Java, and using String.match to do that.

My code:

public static ArrayList<String> extractByHTMLtagDelimiters(String source, String startDelimiter, String endDelimiter){
ArrayList<String> results = new ArrayList<String>();
if (source.matches("([\t\n\r]|.)*" + startDelimiter + "([\t\n\r]|.)*" + endDelimiter)){
    //source has some code samples in it
    //get array entries of the form: {Some code}</startDelimiter>{something else}
    String[] splittedSource = source.split(startDelimiter);
        for (String sourceMatch : splittedSource){
        if (sourceMatch.matches("([\t\n\r]|.)*" + endDelimiter + "([\t\n\r]|.)*")){
            //current string has code sample in it (with some body leftovers)
            //the code sample located before the endDelimiter - extract it
            String codeSample = (sourceMatch.split(endDelimiter))[0];
            //add the code samples to results
            results.add(codeSample);
        }
        }
}
return results;

iv'e tried to extract that samples from some html of ~1300 chars and got pretty massive exception: (it goes on and on for few dozens of lines)

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)

i've found the following bug report:
https://bugs.java.com/bugdatabase/view_bug?bug_id=5050507

is there anything i can do to still use string.match? if not, can you please recommend some other way to do it without implementing html parsing by myself?

Thank a lot,
Dub.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

小红帽 2024-11-06 08:40:10

您可以使用 String 的 indexOf() 方法手动浏览输入字符串,找到开始和结束分隔符,并提取出它们之间的位。

public static void main(String[] args) {
    String source = "<html>blah<code>this is awesome</code>more junk</html>";

    String startDelim = "<code>";
    String endDelim = "</code>";
    int start = source.indexOf(startDelim);
    int end = source.indexOf(endDelim);

    String code = source.substring(start + startDelim.length(), end);
    System.out.println(code);
}

如果您需要查找多个,则只需从完成的位置开始再次使用 indexOf 即可:

int nextStart = source.indexOf(startDelim, end + endDelim.length())

You can just manually go through the input string using String's indexOf() method to find the start and end deliminters and extract out the bits between yourself.

public static void main(String[] args) {
    String source = "<html>blah<code>this is awesome</code>more junk</html>";

    String startDelim = "<code>";
    String endDelim = "</code>";
    int start = source.indexOf(startDelim);
    int end = source.indexOf(endDelim);

    String code = source.substring(start + startDelim.length(), end);
    System.out.println(code);
}

If you need to find more than one, then just use indexOf again starting at the point you finished:

int nextStart = source.indexOf(startDelim, end + endDelim.length())
心是晴朗的。 2024-11-06 08:40:10

只需用 "(?s).*" 替换您的正则表达式模式

即可匹配任何内容,包括您想要的新行。

Simply replace your regex pattern with "(?s).*"

This matches anything including new lines as you intended.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文