URL 正则表达式除外

发布于 2024-09-14 00:43:11 字数 2361 浏览 9 评论 0原文

唉，正则表达式又出问题了。

我在 $text 中有以下内容：

[img]http://www.site.com/logo.jpg[/img]

and 

[url]http://www.site.com[/url]

我有正则表达式：

$text = preg_replace("/(?<!(\[img\]|\[url\]))([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!(\[\/img\]|\[\/url\]))/","there was link",$text);

重点是仅当 url 前面没有 [img] 或 [url] 且后面不跟有 [/img] 或 [/url]。在上一个示例的输出中，我得到：

there was link

and

there was link

URL 以及后向和前向正则表达式都可以单独正常工作。

$text = "[img]bash.org/logo.jpg[/img]";

$text = preg_replace("/(?<!(\[img\]|\[url\]))bash.org(?!(\[\/img\]|\[\/url\]))/","there was link",$text);

echo $text leaves everything as is and gives me [img]bash.org/logo.jpg[/img]

我认为问题出在环视和 URL 正则表达式的组合上。我的错误在哪里？

我想

将 http://www.google.com 替换为“there was链接”，但保留原样“[url]http://www.google.com[/url] "

我明白了

http://www.google.com 替换为“有链接”和 [url]http://www.google.com[/url ] 替换为“有链接”

这里是要测试的 PHP 代码

<?php

$text = "[url]http://www.google.com[/url] <br><br> http://www.google.com"; 
         // should NOT be changed                  //should be changed    

$text = preg_replace("/(?<!\[url\])([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!\[\/url\])/","there was link",$text);

echo $text;

echo '<hr width="100%">';

$text = ":) :-) 0:) 0:-) :)) :-))";

$text = preg_replace("/(?<!0):-?\)(?!\))/","smiley",$text);

echo $text; // lookarounds work

echo '<hr width="100%">';

$text = "http://stackoverflow.com/questions/2482921/regexp-exclusion";

$text = preg_replace("/([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9]/","it's a link to stackoverflow",$text);

echo $text; // URL pattern works fine

?>

原文

Sigh, regex trouble again.

I have following in $text:

[img]http://www.site.com/logo.jpg[/img]

and 

[url]http://www.site.com[/url]

I have regex expression:

$text = preg_replace("/(?<!(\[img\]|\[url\]))([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!(\[\/img\]|\[\/url\]))/","there was link",$text);

The point is to replace url only if it's not preceded by [img] or [url] and not followed by [/img] or [/url]. On the output of previous example I get:

there was link

and

there was link

Both, URL and lookbehind and lookforward regexps are working fine separately.

$text = "[img]bash.org/logo.jpg[/img]";

$text = preg_replace("/(?<!(\[img\]|\[url\]))bash.org(?!(\[\/img\]|\[\/url\]))/","there was link",$text);

echo $text leaves everything as is and gives me [img]bash.org/logo.jpg[/img]

I suppose the problem is in combination of lookarounds and URL regex. Where's my mistake?

I WANT TO

replace http://www.google.com with "there was link", but leave as is "[url]http://www.google.com[/url]"

I'M GETTING

http://www.google.com replaced with "there was link" and [url]http://www.google.com[/url] replaced with "there was link"

HERE'S PHP CODE TO TEST

<?php

$text = "[url]http://www.google.com[/url] <br><br> http://www.google.com"; 
         // should NOT be changed                  //should be changed    

$text = preg_replace("/(?<!\[url\])([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!\[\/url\])/","there was link",$text);

echo $text;

echo '<hr width="100%">';

$text = ":) :-) 0:) 0:-) :)) :-))";

$text = preg_replace("/(?<!0):-?\)(?!\))/","smiley",$text);

echo $text; // lookarounds work

echo '<hr width="100%">';

$text = "http://stackoverflow.com/questions/2482921/regexp-exclusion";

$text = preg_replace("/([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9]/","it's a link to stackoverflow",$text);

echo $text; // URL pattern works fine

?>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云归处 2024-09-21 00:43:11

假设我理解您的意思，您希望将 $input 中的所有 URL 替换为“link was here”，除非 URL 位于 url 或 img bbcode 标记内。环视断言不起作用的原因是因为这些部分实际上与您非常贪婪的 URL 模式匹配（我相当确定它做了很多您不想做的事情）。编写一个与其他文本中的任何有效 URL（包括查询字符串）匹配且与附加的标签不匹配的模式不一定是最简单的事情。特别是因为您当前的模式有 http:// 或 ftp:// 作为可选。

您可能获得成功的唯一方法是确定构成 url 的一组严格规则。

回复收藏 0 原文

时光匆匆的小流年 2024-09-21 00:43:11

很难完全理解你的问题，但看起来你正在做反向 BBcode。那么，如果它被标签包围，就不要管它吗？如果是这种情况，那么我认为您会遇到一个有趣的问题，因为 URL 正则表达式非常复杂。

我认为您可能使事情变得比需要的更复杂。相反，我会更改 BBcode 之间的任何内容。这是我认为需要发生的事情：

找到字符串段“[url]”
捕获任何继续进行的内容
当看到字符串段“[/url]”时结束捕获

这是一个简单的正则表达式：

$string = "[url]http://www.google.com[/url] <br><br> http://www.google.com"; 

$replace = "there was link";
$text = preg_replace_all($regex,$replace,$text);
echo $text;

我知道这不完全是你所要求的（事实上，可能完全相反），但它会达到相同的结果并且更容易。

您可以尝试在此正则表达式中使用负向先行，但我不确定它会给您正确的结果：

$regex = "#(?!\[url\])(.*)(?!\[/url\])#";

一个重要说明：这不会净化用户输入。确保你这样做，但我会将逻辑分开，这样就很容易看到你在做什么以及在哪里做。我还会使用库来执行此操作，因为它更容易而且可能更安全。

It is tough to fully understand your question, but it looks like you're doing reverse BBcode. So, leave it alone if it's surrounded by tags? If that is the case, then I think you will have an interesting problem on your hands because URL regexes are notoriously complex.

I think you may be making this more complex than it needs to be. Instead, I would change anything that is between the BBcode. Here's what I think needs to happen:

find the string segment "[url]"
capture anything that proceeds it
end the capture when the string segment "[/url]" is seen

That is an easy regex:

$string = "[url]http://www.google.com[/url] <br><br> http://www.google.com"; 

$replace = "there was link";
$text = preg_replace_all($regex,$replace,$text);
echo $text;

I know this isn't exactly what you asked for (in fact, probably the exact opposite), but it would achieve the same result and be much easier.

You can probably try using negative lookaheads with this regex, but I am not sure it would give you proper results:

$regex = "#(?!\[url\])(.*)(?!\[/url\])#";

One important note: This does not sanitize user input. Make sure you do this, but I would separate the logic so it is very easy to see what you are doing and where you are doing it. I would also use a library to do this because it's easier and probably safer.

回复收藏 0 原文

寂寞陪衬 2024-09-21 00:43:11

最终工作正则表达式如下所示：

(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])

示例：

<?php

$text = "

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

www.google.com/logo.jpg

http://google.com/logo.jpg

http://www.google.com/logo.jpg

";

$text = nl2br($text);


$text = preg_replace("'(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])'i","<font color=\"#ff0000\">link</font>",$text);

echo $text;

?>

输出：

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

link

link

link

技巧是仅替换以 ^ 或 \s 开头的链接。没有找到其他方法来解决这个问题。

Final working regexp looks like:

(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])

Example:

<?php

$text = "

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

www.google.com/logo.jpg

http://google.com/logo.jpg

http://www.google.com/logo.jpg

";

$text = nl2br($text);


$text = preg_replace("'(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])'i","<font color=\"#ff0000\">link</font>",$text);

echo $text;

?>

outputs:

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

link

link

link

The trick is to replace only links starting with ^ or \s . No other way to solve this issue wasn't found.

回复收藏 0 原文

放血 2024-09-21 00:43:11

我的错误在哪里？

嗯，最糟糕的错误是回顾。这是没有必要的，而且它使工作变得比需要的更加困难。假设现有标签格式良好，您无需费心寻找开始标签；它的存在是通过结束标签的存在来暗示的。

编辑：除了lookbehind之外，您的正则表达式还有其他几个问题，但似乎不值得尝试修复它。相反，我从 RegexBuddy 的内置有用正则表达式库中获取了一个正则表达式，并向其中添加了前瞻功能。

尝试这个正则表达式（或者在 ideone 上查看它的实际情况）：

'_\b(?>
     (?>www\.|ftp\.|(?:https?|ftp|file)://)  # scheme or subdomain
     [-+&@#/%=~|$?!:,.\w]*[+&@#/%=~|$\w]     # everything else
   )(?!\[/(?:img|url)\])
 _x'

仅仅因为问题可以被描述 /em> 就向前或向后、向前或向后等而言，并不意味着您应该以这种方式设计正则表达式。特别是后视永远不应该成为您使用的第一个工具。

Where's my mistake?

Well, the worst mistake is the lookbehind. It isn't needed, and it's making the job much harder than it needs to be. Assuming the existing tags are well formed, you needn't bother looking for the opening tag; its presence is implied by the presence of the closing tag.

EDIT: Your regex has several other problems besides the lookbehind, but it didn't seem worthwhile to try and fix it. Instead, I grabbed a regex from RegexBuddy's built-in library of useful regexes, and added the lookahead to it.

Try this regex (or see it in action on ideone):

'_\b(?>
     (?>www\.|ftp\.|(?:https?|ftp|file)://)  # scheme or subdomain
     [-+&@#/%=~|$?!:,.\w]*[+&@#/%=~|$\w]     # everything else
   )(?!\[/(?:img|url)\])
 _x'

Just because a problem can be described in terms of looking forward or backward, preceding or following, etc., doesn't mean you should design the regex that way. Lookbehind in particular should never be the first tool you reach for.

回复收藏 0 原文

~没有更多了~