当前位置：文江博客话题详情

PHP regex negative-lookbehind

php 中的负向后看和贪婪量词

发布于 2024-09-27 06:24:30 字数 527 浏览 11 评论 0 原文

我使用正则表达式来查找任何 URL 并相应地链接它们。但是，我不想链接任何已链接的 URL，因此我使用 Lookbehind 来查看 URL 之前是否有 href。但这会失败，因为 PHP 的前向和后向中不允许使用可变长度量词。

这是匹配的正则表达式：

/\b(?<!href\s*=\s*[\'\"])((?:http:\/\/|www\.)\S*?)(?=\s|$)/i

解决此问题的最佳方法是什么？

编辑：

我还没有测试它，但我认为在单个正则表达式中执行此操作的技巧是在正则表达式中使用条件表达式，这是 PCRE 支持的。它看起来像这样：

/(href\s*=\s*[\'\"])?(?(1)^|)((?:http:\/\/|www\.)\w[\w\d\.\/]*)(?=\s|$)/i

关键点是，如果捕获了 href，由于条件 (?(1)^|) ，匹配会立即被抛出，保证不匹配。可能有什么问题。明天我会测试一下。

原文

I'm using a regex to find any URLs and link them accordingly. However, I do not want to linkify any URLs that are already linked so I'm using lookbehind to see if the URL has an href before it.
This fails though because variable length quantifiers aren't allowed in lookahead and lookbehind for PHP.

Here's the regex for the match:

/\b(?<!href\s*=\s*[\'\"])((?:http:\/\/|www\.)\S*?)(?=\s|$)/i

What's the best way around this problem?

EDIT:

I have yet to test it, but I think the trick to doing it in a single regex is using conditional expressions within the regex, which is supported by PCRE. It would look something like this:

/(href\s*=\s*[\'\"])?(?(1)^|)((?:http:\/\/|www\.)\w[\w\d\.\/]*)(?=\s|$)/i

The key point is that if the href is captured, the match is immediately thrown out due to the conditional (?(1)^|), which is guaranteed to not match.
There's probably something wrong with it. I'll test it out tomorrow.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

高冷爸爸 2024-10-04 06:24:30

我尝试以相反的方式做同样的事情：确保 URL 不以 "> 结尾：

/((?:http:\/\/|www\.)(?:[^"\s]|"[^>]|(*FAIL))*?)(?=\s|$)/i

但对我来说，这看起来很老套，我相信你可以做得更好。

我的第二个方法与您的方法更相似（因此更精确）：

/href\s*=\s*"[^"]*"(*SKIP)(*FAIL)|((?:http:\/\/|www\.)\S*?)(?=\s|$)/i

如果我找到 href= 我 (*SKIP)(*FAIL) 这意味着我会跳转到。正则表达式引擎遇到 (*SKIP) 时所处的位置，

但这同样很麻烦，我确信有更好的选择。

I tried doing the same thing the other way round: ensure that the URL doesn't end in ">:

/((?:http:\/\/|www\.)(?:[^"\s]|"[^>]|(*FAIL))*?)(?=\s|$)/i

But for me that looks pretty hacky, I'm sure you can do better.

My second approach is more similar to yours (and thus is more precise):

/href\s*=\s*"[^"]*"(*SKIP)(*FAIL)|((?:http:\/\/|www\.)\S*?)(?=\s|$)/i

If I find an href= I (*SKIP)(*FAIL). This means that I jump to the position the regex engine is at, when it encounters the (*SKIP).

But that's no less hacky and I'm sure there is a better alternative.

回复收藏 0 原文

莫多说 2024-10-04 06:24:30

查找“每个不属于链接的 URL”是相当困难的负逻辑。找到每个 URL，然后找到每个作为链接的 URL，然后从前面的列表中删除后者中的每个 URL 可能会更容易。

至于查找哪些 URL 是链接的一部分，请尝试：

/<a([\s]+[\w="]+)*[\s]+href[\s]*=[\s]*"([\w\s:/.?+&=]+)"([\s]+[\w="]+)*>/i

我使用 http:// 进行了测试regexpal.com/ 可以肯定。它首先查找 ，然后允许任意数量的参数，然后是 href，最后是任意其他数量的参数。如果没有 href，则它不是链接。如果它不是标记，则它不是链接。由于这只是我们想要从其他（URL）列表中删除的内容的列表，因此我将 URL 的定义简化为 [\w\s:/.?+& ;=]+。至于生成 URL 列表，您需要更智能的东西。

Finding "every URL that isn't part of a link" is quite difficult negative logic. It may be easier to find every URL, then every URL that's a link, and remove every of the latter from the former list.

As far as finding which URLs are a part of a link, try:

/<a([\s]+[\w="]+)*[\s]+href[\s]*=[\s]*"([\w\s:/.?+&=]+)"([\s]+[\w="]+)*>/i

I tested it with http://regexpal.com/ to be sure. It looks for the <a first, then it allows for any number of parameters, followed by href, followed by any other number of parameters. If it doesn't have the href, it's not a link. If it isn't an <a> tag, it's not a link. Since this is just the list of what we want to remove from the other list (of URLs), I simplified the definition of a URL to [\w\s:/.?+&=]+. As far as generating a list of URLs, you'll want something smarter.

回复收藏 0 原文

你的笑 2024-10-04 06:24:30

我没有更好的正则表达式。但如果您找不到更好的正则表达式，那么我建议使用两个查询来完成该任务。首先，找到并删除所有链接，然后搜索网址。这可能会更容易、更快。
（对于一次查找和替换，您可以使用类似 - http://www.satya-weblog.com/2010/08/php-regex-find-and-replace-any-word -string-or-text-at-one-go.html）。