Java正则表达式性能
我正在尝试使用 Java 解析正则表达式的链接。
但我认为它变得太慢了。例如,要从以下位置提取所有链接:
...它花费了 34642 毫秒(34 秒!!!)
这是正则表达式:
private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";
模式的标志:
private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;
代码可能是这样的:
private void processURL(URL url){
URLConnection connection;
Pattern pattern = Pattern.compile(regexp, flags);
try {
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
String html = new String();
String line = bf.readLine();
while(line!=null){
html += line;
line = bf.readLine();
}
bf.close();
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
} catch (Exception e){
}
}
能给个提示吗?
额外数据:
1Mbit
酷睿 2 双核
1Gb 内存
单线程
I'm trying to parse links with regex with Java.
But I think it's getting too slow. For example, to extract all links from:
...it's spending 34642 milliseconds (34 seconds!!!)
Here is the regex:
private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";
The flags for the pattern:
private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;
And the code may be something like this:
private void processURL(URL url){
URLConnection connection;
Pattern pattern = Pattern.compile(regexp, flags);
try {
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
String html = new String();
String line = bf.readLine();
while(line!=null){
html += line;
line = bf.readLine();
}
bf.close();
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
} catch (Exception e){
}
}
Can you give me a Hint?
Extra Data:
1Mbit
Core 2 Duo
1Gb RAM
Single Threaded
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
提示:不要使用正则表达式进行链接提取或其他 HTML“解析”任务!
您的正则表达式中有 6 (六) 个重复组。执行它需要大量的回溯。在最坏的情况下,它甚至可能接近
O(N^6)
,其中 N 是输入字符的数量。你可以通过用惰性匹配替换急切匹配来缓解这个问题,但是几乎不可能避免病态的情况;例如,当输入数据的格式严重错误以至于正则表达式不匹配时。一个更好的解决方案是使用一些现有的严格或宽松的 HTML 解析器。即使手动编写临时解析器也会比使用粗糙的正则表达式更好。
此页面列出了 Java 的各种 HTML 解析器。我听说过有关 TagSoup 和 HtmlCleaner 的好消息。
Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!
Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach
O(N^6)
where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.
This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.
你所有的时间,全部,都花在了这里:
使用 StringBuffer。更好的是,如果可以的话,在每条线上进行匹配,并且根本不要累积它们。
All your time, all of it, is being spent here:
Use a StringBuffer. Better still, if you can, run the match on every line and don't accumulate them at all.
我编写了简单的测试,用于将 1000 万次操作 RegExp 性能与
String.indexof()
进行比较,结果如下:I have written simple test for comparing 10 million operation RegExp performance against
String.indexof()
with the following result:请尝试 Jaunt。请不要为此使用正则表达式。
正则表达式的使用与正则表达式的滥用
来源
Try Jaunt instead. Please don't use regex for this.
Regex use vs. Regex abuse
Source