Java 正则表达式非常慢(将嵌套量词转换为所有格量词)

发布于 2024-11-24 02:14:26 字数 1993 浏览 1 评论 0原文

我发现这个正则表达式可以匹配url(最初由Daring Fireball在Javascript中使用),它在java中可以工作,但在某些情况下非常慢:

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

并且我发现了主题: Java 正则表达式运行速度非常慢 问题出在这段代码中:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

似乎要解决这个问题,我需要使这些内部量词具有所有格(实际上是嵌套的),但我不知道该怎么做 感谢您的建议,并对我的英语不好表示歉意!

I've found this regular expression to match urls (originally in Javascript by Daring Fireball) which in java works but in some cases is extremly slow:

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

and i've found on topic: Java Regular Expression running very slow that the problem is in this block of code:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

and it seems that to solve the problem i need to make these inner quantifiers possessive (which actually are nested), but i don't know how to do that
Thanks in advice and sorry for my BAD english!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

安静被遗忘 2024-12-01 02:14:27

您可以通过使用 java.net.URL 或 java.net.URI 解析 url 来避免所有这些情况。


  1. java.io.URI 的解析效果比 java.net.URL 更好。尝试一下。

  2. 解析完 url 后,您可以检查每个组件;例如,检查主机名是否可以解析。

  3. 如果您想要能够解析的网址,则需要区分绝对网址和非绝对网址,并检查“方案”是否是您可以处理的。

  4. 如果不实际尝试打开资源,您无法检查 URL 是否有效(即它是否对应于可检索的资源)。由于多种可能的原因,即使这也不是最终的测试。

You can avoid all of this by using java.net.URL or java.net.URI to parse the urls.


  1. java.io.URI does a better job of parsing than java.net.URL. Try that one.

  2. Once you've parsed the url, you can check each of the components; e.g. check that the hostname can be resolved.

  3. If you want urls that will resolve, you need to distinguish between absolute and non-absolute urls, and check that the "scheme" is one that you can cope with.

  4. You cannot check that a url works (i.e. that it corresponds to a retrievable resource) without actually attempting to open the resource. And even that isn't definitive test, for a number of possible reasons.

不忘初心 2024-12-01 02:14:27

您可能会遇到灾难性回溯的情况:检查您的正则表达式是否与多个组中的相同字符不匹配,从而导致必须检查的组合数量失控。

有关说明,请参阅本文

You might have a case of catastrophic backtracking: Check that your regex doesn't match the same characters in multiple groups, causing a runaway number of combinations that must be checked.

See this article for an explanation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文