在java中使用字符串匹配时出现stackoverflow异常
对于我正在做的一个小型大学项目,我需要从以字符串形式给出的 html 中提取代码示例。 更准确地说,我需要从该 html 字符串中获取
之间的所有内容。 和
我用 Java 编写,并使用 String.match 来做到这一点。
我的代码:
public static ArrayList<String> extractByHTMLtagDelimiters(String source, String startDelimiter, String endDelimiter){
ArrayList<String> results = new ArrayList<String>();
if (source.matches("([\t\n\r]|.)*" + startDelimiter + "([\t\n\r]|.)*" + endDelimiter)){
//source has some code samples in it
//get array entries of the form: {Some code}</startDelimiter>{something else}
String[] splittedSource = source.split(startDelimiter);
for (String sourceMatch : splittedSource){
if (sourceMatch.matches("([\t\n\r]|.)*" + endDelimiter + "([\t\n\r]|.)*")){
//current string has code sample in it (with some body leftovers)
//the code sample located before the endDelimiter - extract it
String codeSample = (sourceMatch.split(endDelimiter))[0];
//add the code samples to results
results.add(codeSample);
}
}
}
return results;
iv'e 尝试从大约 1300 个字符的 html 中提取该样本,并得到了相当大的异常:(它持续了几十行)
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
我发现了以下错误报告: https://bugs.java.com/bugdatabase/view_bug?bug_id=5050507
我能做些什么来仍然使用 string.match 吗?如果没有,你能推荐一些其他方法来做到这一点,而不需要我自己实现html解析吗?
非常感谢, 配音。
For a little university project i'm doing, i need to extract code samples from html given as a string.
To by more precise, i need to get from that html string, everything in between <code>
and </code>
.
I'm writing in Java, and using String.match to do that.
My code:
public static ArrayList<String> extractByHTMLtagDelimiters(String source, String startDelimiter, String endDelimiter){
ArrayList<String> results = new ArrayList<String>();
if (source.matches("([\t\n\r]|.)*" + startDelimiter + "([\t\n\r]|.)*" + endDelimiter)){
//source has some code samples in it
//get array entries of the form: {Some code}</startDelimiter>{something else}
String[] splittedSource = source.split(startDelimiter);
for (String sourceMatch : splittedSource){
if (sourceMatch.matches("([\t\n\r]|.)*" + endDelimiter + "([\t\n\r]|.)*")){
//current string has code sample in it (with some body leftovers)
//the code sample located before the endDelimiter - extract it
String codeSample = (sourceMatch.split(endDelimiter))[0];
//add the code samples to results
results.add(codeSample);
}
}
}
return results;
iv'e tried to extract that samples from some html of ~1300 chars and got pretty massive exception: (it goes on and on for few dozens of lines)
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
i've found the following bug report:
https://bugs.java.com/bugdatabase/view_bug?bug_id=5050507
is there anything i can do to still use string.match? if not, can you please recommend some other way to do it without implementing html parsing by myself?
Thank a lot,
Dub.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 String 的 indexOf() 方法手动浏览输入字符串,找到开始和结束分隔符,并提取出它们之间的位。
如果您需要查找多个,则只需从完成的位置开始再次使用
indexOf
即可:You can just manually go through the input string using String's indexOf() method to find the start and end deliminters and extract out the bits between yourself.
If you need to find more than one, then just use
indexOf
again starting at the point you finished:只需用
"(?s).*"
替换您的正则表达式模式即可匹配任何内容,包括您想要的新行。
Simply replace your regex pattern with
"(?s).*"
This matches anything including new lines as you intended.