Java正则表达式性能

发布于 2024-09-27 04:46:45 字数 1371 浏览 5 评论 0原文

我正在尝试使用 Java 解析正则表达式的链接。

但我认为它变得太慢了。例如,要从以下位置提取所有链接:

...它花费了 34642 毫秒(34 秒!!!)

这是正则表达式:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

模式的标志:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

代码可能是这样的:

private void processURL(URL url){
    URLConnection connection;
    Pattern pattern = Pattern.compile(regexp, flags);
    try {
        connection = url.openConnection();
        InputStream in = connection.getInputStream();
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        String html = new String();
        String line = bf.readLine();            
        while(line!=null){
            html += line;
            line = bf.readLine();
        }
        bf.close();
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            System.out.println(matcher.group(2));
        }
     } catch (Exception e){
     }
 }

能给个提示吗?

额外数据:
1Mbit
酷睿 2 双核
1Gb 内存
单线程

I'm trying to parse links with regex with Java.

But I think it's getting too slow. For example, to extract all links from:

...it's spending 34642 milliseconds (34 seconds!!!)

Here is the regex:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

The flags for the pattern:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

And the code may be something like this:

private void processURL(URL url){
    URLConnection connection;
    Pattern pattern = Pattern.compile(regexp, flags);
    try {
        connection = url.openConnection();
        InputStream in = connection.getInputStream();
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        String html = new String();
        String line = bf.readLine();            
        while(line!=null){
            html += line;
            line = bf.readLine();
        }
        bf.close();
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            System.out.println(matcher.group(2));
        }
     } catch (Exception e){
     }
 }

Can you give me a Hint?

Extra Data:
1Mbit
Core 2 Duo
1Gb RAM
Single Threaded

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

非要怀念 2024-10-04 04:46:45

提示:不要使用正则表达式进行链接提取或其他 HTML“解析”任务!

您的正则表达式中有 6 (六) 个重复组。执行它需要大量的回溯。在最坏的情况下,它甚至可能接近 O(N^6),其中 N 是输入字符的数量。你可以通过用惰性匹配替换急切匹配来缓解这个问题,但是几乎不可能避免病态的情况;例如,当输入数据的格式严重错误以至于正则表达式不匹配时。

一个更好的解决方案是使用一些现有的严格或宽松的 HTML 解析器。即使手动编写临时解析器也会比使用粗糙的正则表达式更好。

此页面列出了 Java 的各种 HTML 解析器。我听说过有关 TagSoup 和 HtmlCleaner 的好消息。

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.

自由范儿 2024-10-04 04:46:45

你所有的时间,全部,都花在了这里:

 html+=line;

使用 StringBuffer。更好的是,如果可以的话,在每条线上进行匹配,并且根本不要累积它们。

All your time, all of it, is being spent here:

 html+=line;

Use a StringBuffer. Better still, if you can, run the match on every line and don't accumulate them at all.

唯憾梦倾城 2024-10-04 04:46:45

我编写了简单的测试,用于将 1000 万次操作 RegExp 性能与 String.indexof() 进行比较,结果如下:

0.447 seconds
6.174 seconds
13.812080536912752 times regexp longer.

import java.util.regex.Pattern;

public class TestRegExpSpeed {
    public static void main(String[] args) {
        String match = "FeedUserMain_231_Holiday_Feed_MakePresent-1_";
        String unMatch = "FeedUserMain_231_Holiday_Feed_Make2Present-1_";

        long start = System.currentTimeMillis();
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                match.indexOf("MakePresent");
            } else {
                unMatch.indexOf("MakePresent");
            }
        }

        double indexOf = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(indexOf + " seconds");

        start = System.currentTimeMillis();
        Pattern compile = Pattern.compile(".*?MakePresent.*?");
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                compile.matcher(match).matches();
            } else {
                compile.matcher(unMatch).matches();
            }
        }
        double reaexp = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(reaexp + " seconds");

        System.out.println(reaexp / indexOf + " times regexp longer. ");
    }
}

I have written simple test for comparing 10 million operation RegExp performance against String.indexof() with the following result:

0.447 seconds
6.174 seconds
13.812080536912752 times regexp longer.

import java.util.regex.Pattern;

public class TestRegExpSpeed {
    public static void main(String[] args) {
        String match = "FeedUserMain_231_Holiday_Feed_MakePresent-1_";
        String unMatch = "FeedUserMain_231_Holiday_Feed_Make2Present-1_";

        long start = System.currentTimeMillis();
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                match.indexOf("MakePresent");
            } else {
                unMatch.indexOf("MakePresent");
            }
        }

        double indexOf = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(indexOf + " seconds");

        start = System.currentTimeMillis();
        Pattern compile = Pattern.compile(".*?MakePresent.*?");
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                compile.matcher(match).matches();
            } else {
                compile.matcher(unMatch).matches();
            }
        }
        double reaexp = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(reaexp + " seconds");

        System.out.println(reaexp / indexOf + " times regexp longer. ");
    }
}
半寸时光 2024-10-04 04:46:45

请尝试 Jaunt。请不要为此使用正则表达式。

正则表达式的使用与正则表达式的滥用

正则表达式不是解析器。
虽然你可以做一些令人惊奇的事情
带有正则表达式的东西,它们
在平衡标签匹配方面较弱。
一些正则表达式变体具有平衡性
匹配,但这显然是一个黑客行为——
和一个令人讨厌的。你经常可以做到
有点工作,就像我在
日常消毒。但无论怎样
聪明你的正则表达式,不要欺骗
你自己:它绝不是、形状或
代替真实的生活
解析器。

来源

Try Jaunt instead. Please don't use regex for this.

Regex use vs. Regex abuse

Regular expressions are not Parsers.
Although you can do some amazing
things with regular expressions, they
are weak at balanced tag matching.
Some regex variants have balanced
matching, but it is clearly a hack --
and a nasty one. You can often make it
kinda-sorta work, as I have in the
sanitize routine. But no matter how
clever your regex, don't delude
yourself: it is in no way, shape or
form a substitute for a real live
parser.

Source

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文