正则表达式匹配 URL 后不跟“”或<

发布于 2024-11-01 00:14:18 字数 1185 浏览 3 评论 0 原文

我正在尝试修改网址匹配正则表达式 http://daringfireball.net/2010/07/improved_regex_for_matching_urls 不匹配已属于有效 URL 标记或用作链接文本的任何内容。

例如，在下面的字符串中，我想匹配 http://www.foo.com，但不是 < a href="http://www.bar.com" rel="nofollow">http://www.bar.com 或 http://www.baz.com

www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>

我试图添加一个否定的前瞻来排除后面跟着 " 或 < 的匹配项，但由于某种原因，它只适用于中的 "m"因此，这个正则表达式仍然返回 http://www.bar.co 和 http://www.baz.co 作为匹配。

我看不出我做错了什么......有什么想法吗？

\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])

这里是还有一个更简单的例子：

((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

原文

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.

For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com

www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>

I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.

I can't see what I'm doing wrong... any ideas?

\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])

Here is a simpler example too:

((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷清清 2024-11-08 00:14:18

我去年研究了这个问题，并开发了一个您可能想要查看的解决方案 - 请参阅： URL 链接（HTTP/FTP）此链接是 Javascript 解决方案的测试页，其中包含许多难以链接的 URL 示例。

我的正则表达式解决方案是为 PHP 和 Javascript 编写的 - 并不简单（但事实证明问题也不是。）有关更多信息，我还建议阅读：

URL 问题，作者：Jeff Atwood，以及
用于匹配 URL 的改进的自由、准确的正则表达式模式作者：John Gruber

Jeff 的博客文章后面的评论是如果您想正确执行此操作，则必须阅读...

另请注意，John Gruber 的正则表达式有一个组件可以进入灾难性回溯领域（与一级匹配括号匹配的部分）。

回复收藏 0 原文

江挽川 2024-11-08 00:14:18

是的，如果您只想排除尾随字符，只需使表达式“独立”，那么该段中就不会发生回溯，这实际上是微不足道的。

(?>\b ...)(?!["<])

Perl 测试：

use strict;
use warnings;

my $str = 'www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>http://www.some.com';

while ($str =~ m~
 (?>
    \b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
 )
 (?!["<])
~xg)
{
   print "$1\n";
}

输出：

www.foo.com
http://www.some.com

Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.

(?>\b ...)(?!["<])

A perl test:

use strict;
use warnings;

my $str = 'www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>http://www.some.com';

while ($str =~ m~
 (?>
    \b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
 )
 (?!["<])
~xg)
{
   print "$1\n";
}

Output:

www.foo.com
http://www.some.com

回复收藏 0 原文

~没有更多了~