Java正则表达式性能

发布于 2024-09-27 04:46:45 字数 1371 浏览 5 评论 0原文

我正在尝试使用 Java 解析正则表达式的链接。

但我认为它变得太慢了。例如，要从以下位置提取所有链接：

http://news.google。 com.ar/nwshp?hl=es&tab=wn

...它花费了 34642 毫秒（34 秒！！！）

这是正则表达式：

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

模式的标志：

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

代码可能是这样的:

private void processURL(URL url){
    URLConnection connection;
    Pattern pattern = Pattern.compile(regexp, flags);
    try {
        connection = url.openConnection();
        InputStream in = connection.getInputStream();
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        String html = new String();
        String line = bf.readLine();            
        while(line!=null){
            html += line;
            line = bf.readLine();
        }
        bf.close();
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            System.out.println(matcher.group(2));
        }
     } catch (Exception e){
     }
 }

能给个提示吗？

额外数据：
1Mbit
酷睿 2 双核
1Gb 内存
单线程

原文

I'm trying to parse links with regex with Java.

But I think it's getting too slow. For example, to extract all links from:

http://news.google.com.ar/nwshp?hl=es&tab=wn

...it's spending 34642 milliseconds (34 seconds!!!)

Here is the regex:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

The flags for the pattern:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

And the code may be something like this:

private void processURL(URL url){
    URLConnection connection;
    Pattern pattern = Pattern.compile(regexp, flags);
    try {
        connection = url.openConnection();
        InputStream in = connection.getInputStream();
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        String html = new String();
        String line = bf.readLine();            
        while(line!=null){
            html += line;
            line = bf.readLine();
        }
        bf.close();
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            System.out.println(matcher.group(2));
        }
     } catch (Exception e){
     }
 }

Can you give me a Hint?

Extra Data:
1Mbit
Core 2 Duo
1Gb RAM
Single Threaded

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

非要怀念 2024-10-04 04:46:45

提示：不要使用正则表达式进行链接提取或其他 HTML“解析”任务！

您的正则表达式中有 6 (六) 个重复组。执行它需要大量的回溯。在最坏的情况下，它甚至可能接近 O(N^6)，其中 N 是输入字符的数量。你可以通过用惰性匹配替换急切匹配来缓解这个问题，但是几乎不可能避免病态的情况；例如，当输入数据的格式严重错误以至于正则表达式不匹配时。

一个更好的解决方案是使用一些现有的严格或宽松的 HTML 解析器。即使手动编写临时解析器也会比使用粗糙的正则表达式更好。

此页面列出了 Java 的各种 HTML 解析器。我听说过有关 TagSoup 和 HtmlCleaner 的好消息。

回复收藏 0 原文

自由范儿 2024-10-04 04:46:45

你所有的时间，全部，都花在了这里：

 html+=line;

使用 StringBuffer。更好的是，如果可以的话，在每条线上进行匹配，并且根本不要累积它们。

All your time, all of it, is being spent here:

 html+=line;

Use a StringBuffer. Better still, if you can, run the match on every line and don't accumulate them at all.

回复收藏 0 原文

唯憾梦倾城 2024-10-04 04:46:45

我编写了简单的测试，用于将 1000 万次操作 RegExp 性能与 String.indexof() 进行比较，结果如下：

0.447 seconds
6.174 seconds
13.812080536912752 times regexp longer.

import java.util.regex.Pattern;

public class TestRegExpSpeed {
    public static void main(String[] args) {
        String match = "FeedUserMain_231_Holiday_Feed_MakePresent-1_";
        String unMatch = "FeedUserMain_231_Holiday_Feed_Make2Present-1_";

        long start = System.currentTimeMillis();
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                match.indexOf("MakePresent");
            } else {
                unMatch.indexOf("MakePresent");
            }
        }

        double indexOf = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(indexOf + " seconds");

        start = System.currentTimeMillis();
        Pattern compile = Pattern.compile(".*?MakePresent.*?");
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                compile.matcher(match).matches();
            } else {
                compile.matcher(unMatch).matches();
            }
        }
        double reaexp = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(reaexp + " seconds");

        System.out.println(reaexp / indexOf + " times regexp longer. ");
    }
}

I have written simple test for comparing 10 million operation RegExp performance against String.indexof() with the following result:

0.447 seconds
6.174 seconds
13.812080536912752 times regexp longer.

import java.util.regex.Pattern;

public class TestRegExpSpeed {
    public static void main(String[] args) {
        String match = "FeedUserMain_231_Holiday_Feed_MakePresent-1_";
        String unMatch = "FeedUserMain_231_Holiday_Feed_Make2Present-1_";

        long start = System.currentTimeMillis();
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                match.indexOf("MakePresent");
            } else {
                unMatch.indexOf("MakePresent");
            }
        }

        double indexOf = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(indexOf + " seconds");

        start = System.currentTimeMillis();
        Pattern compile = Pattern.compile(".*?MakePresent.*?");
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                compile.matcher(match).matches();
            } else {
                compile.matcher(unMatch).matches();
            }
        }
        double reaexp = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(reaexp + " seconds");

        System.out.println(reaexp / indexOf + " times regexp longer. ");
    }
}

回复收藏 0 原文

半寸时光 2024-10-04 04:46:45

请尝试 Jaunt。请不要为此使用正则表达式。

正则表达式的使用与正则表达式的滥用

正则表达式不是解析器。
虽然你可以做一些令人惊奇的事情
带有正则表达式的东西，它们
在平衡标签匹配方面较弱。
一些正则表达式变体具有平衡性
匹配，但这显然是一个黑客行为——
和一个令人讨厌的。你经常可以做到
有点工作，就像我在
日常消毒。但无论怎样
聪明你的正则表达式，不要欺骗
你自己：它绝不是、形状或
代替真实的生活
解析器。

来源