Java 正则表达式非常慢（将嵌套量词转换为所有格量词）

发布于 2024-11-24 02:14:26 字数 1993 浏览 4 评论 0原文

我发现这个正则表达式可以匹配url（最初由Daring Fireball在Javascript中使用），它在java中可以工作，但在某些情况下非常慢：

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

并且我发现了主题： Java 正则表达式运行速度非常慢问题出在这段代码中：

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

似乎要解决这个问题，我需要使这些内部量词具有所有格（实际上是嵌套的），但我不知道该怎么做感谢您的建议，并对我的英语不好表示歉意！

原文

I've found this regular expression to match urls (originally in Javascript by Daring Fireball) which in java works but in some cases is extremly slow:

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

and i've found on topic: Java Regular Expression running very slow that the problem is in this block of code:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

and it seems that to solve the problem i need to make these inner quantifiers possessive (which actually are nested), but i don't know how to do that
Thanks in advice and sorry for my BAD english!

分享到QQ

分享到微博