Java 正则表达式非常慢(将嵌套量词转换为所有格量词)
我发现这个正则表达式可以匹配url(最初由Daring Fireball在Javascript中使用),它在java中可以工作,但在某些情况下非常慢:
private final static String pattern =
"\\b" +
"(" + // Capture 1: entire matched URL
"(?:" +
"[a-z][\\w-]+:" + // URL protocol and colon
"(?:" +
"/{1,3}" + // 1-3 slashes
"|" + // or
"[a-z0-9%]" + // Single letter or digit or '%'
// (Trying not to match e.g. "URI::Escape")
")" +
"|" + // or
"www\\d{0,3}[.]" + // "www.", "www1.", "www2." … "www999."
"|" + // or
"[a-z0-9.\\-]+[.][a-z]{2,4}/" + // looks like domain name followed by a slash
")" +
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space, non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
")+" +
"(?:" + // End with:
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
"|" + // or
"[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" + // not a space or one of these punct chars (updated to add a 'dash'
")" +
")";
并且我发现了主题: Java 正则表达式运行速度非常慢 问题出在这段代码中:
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space, non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
")+"
似乎要解决这个问题,我需要使这些内部量词具有所有格(实际上是嵌套的),但我不知道该怎么做 感谢您的建议,并对我的英语不好表示歉意!
I've found this regular expression to match urls (originally in Javascript by Daring Fireball) which in java works but in some cases is extremly slow:
private final static String pattern =
"\\b" +
"(" + // Capture 1: entire matched URL
"(?:" +
"[a-z][\\w-]+:" + // URL protocol and colon
"(?:" +
"/{1,3}" + // 1-3 slashes
"|" + // or
"[a-z0-9%]" + // Single letter or digit or '%'
// (Trying not to match e.g. "URI::Escape")
")" +
"|" + // or
"www\\d{0,3}[.]" + // "www.", "www1.", "www2." … "www999."
"|" + // or
"[a-z0-9.\\-]+[.][a-z]{2,4}/" + // looks like domain name followed by a slash
")" +
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space, non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
")+" +
"(?:" + // End with:
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
"|" + // or
"[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" + // not a space or one of these punct chars (updated to add a 'dash'
")" +
")";
and i've found on topic: Java Regular Expression running very slow that the problem is in this block of code:
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space, non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
")+"
and it seems that to solve the problem i need to make these inner quantifiers possessive (which actually are nested), but i don't know how to do that
Thanks in advice and sorry for my BAD english!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以通过使用 java.net.URL 或 java.net.URI 解析 url 来避免所有这些情况。
java.io.URI
的解析效果比java.net.URL
更好。尝试一下。解析完 url 后,您可以检查每个组件;例如,检查主机名是否可以解析。
如果您想要能够解析的网址,则需要区分绝对网址和非绝对网址,并检查“方案”是否是您可以处理的。
如果不实际尝试打开资源,您无法检查 URL 是否有效(即它是否对应于可检索的资源)。由于多种可能的原因,即使这也不是最终的测试。
You can avoid all of this by using
java.net.URL
orjava.net.URI
to parse the urls.java.io.URI
does a better job of parsing thanjava.net.URL
. Try that one.Once you've parsed the url, you can check each of the components; e.g. check that the hostname can be resolved.
If you want urls that will resolve, you need to distinguish between absolute and non-absolute urls, and check that the "scheme" is one that you can cope with.
You cannot check that a url works (i.e. that it corresponds to a retrievable resource) without actually attempting to open the resource. And even that isn't definitive test, for a number of possible reasons.
您可能会遇到灾难性回溯的情况:检查您的正则表达式是否与多个组中的相同字符不匹配,从而导致必须检查的组合数量失控。
有关说明,请参阅本文。
You might have a case of catastrophic backtracking: Check that your regex doesn't match the same characters in multiple groups, causing a runaway number of combinations that must be checked.
See this article for an explanation.