正则表达式匹配 URL 后不跟“”或<

发布于 2024-11-01 00:14:18 字数 1185 浏览 3 评论 0 原文

我正在尝试修改网址匹配正则表达式 http://daringfireball.net/2010/07/improved_regex_for_matching_urls 不匹配已属于有效 URL 标记或用作链接文本的任何内容。

例如,在下面的字符串中,我想匹配 http://www.foo.com,但不是 < a href="http://www.bar.com" rel="nofollow">http://www.bar.com 或 http://www.baz.com

www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>

我试图添加一个否定的前瞻来排除后面跟着 " 或 < 的匹配项,但由于某种原因,它只适用于中的 "m"因此,这个正则表达式仍然返回 http://www.bar.cohttp://www.baz.co 作为匹配。

我看不出我做错了什么......有什么想法吗?

\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])

这里是还有一个更简单的例子:

((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.

For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com

www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>

I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.

I can't see what I'm doing wrong... any ideas?

\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])

Here is a simpler example too:

((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冷清清 2024-11-08 00:14:18

我去年研究了这个问题,并开发了一个您可能想要查看的解决方案 - 请参阅: URL 链接(HTTP/FTP) 此链接是 Javascript 解决方案的测试页,其中包含许多难以链接的 URL 示例。

我的正则表达式解决方案是为 PHP 和 Javascript 编写的 - 并不简单(但事实证明问题也不是。)有关更多信息,我还建议阅读:

URL 问题,作者:Jeff Atwood,以及
用于匹配 URL 的改进的自由、准确的正则表达式模式作者:John Gruber

Jeff 的博客文章后面的评论是如果您想正确执行此操作,则必须阅读...

另请注意,John Gruber 的正则表达式有一个组件可以进入灾难性回溯领域(与一级匹配括号匹配的部分)。

I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.

My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:

The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber

The comments following Jeff's blog post are a must read if you want to do this right...

Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).

江挽川 2024-11-08 00:14:18

是的,如果您只想排除尾随字符,只需使表达式“独立”,那么该段中就不会发生回溯,这实际上是微不足道的。

(?>\b ...)(?!["<])

Perl 测试:

use strict;
use warnings;

my $str = 'www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>http://www.some.com';

while ($str =~ m~
 (?>
    \b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
 )
 (?!["<])
~xg)
{
   print "$1\n";
}

输出:

www.foo.com
http://www.some.com

Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.

(?>\b ...)(?!["<])

A perl test:

use strict;
use warnings;

my $str = 'www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>http://www.some.com';

while ($str =~ m~
 (?>
    \b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
 )
 (?!["<])
~xg)
{
   print "$1\n";
}

Output:

www.foo.com
http://www.some.com

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文