当前位置：文江博客话题详情

对于各种链接和 href 分隔符（" 和 '）的正则表达式有点头疼

发布于 2024-10-01 01:59:22 字数 742 浏览 2 评论 0 原文

因此，我想将以下链接结构与 php 中的 preg_match_all 进行匹配。

<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>

我可以通过执行以下操作来获取“和 ' 限定网址

'#<a[^>]*?href=("|\')(.*?)("|\')#is'

，或者我可以获得全部 3 个，但如果前两个中有空格则不能：

'#<a[^>]*?href=("|\')?(.*?)[\s\"\'>]#is'

我怎样才能获得制定此格式，以便它会拾取用潜在空格分隔的“和”，但也会拾取正确编码的没有分隔符的 URL。

原文

So, I want to match the following link structures with a preg_match_all in php..

<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>

I can get " and ' deilmited urls one by doing

'#<a[^>]*?href=("|\')(.*?)("|\')#is'

or I can get all 3, but not if there are spaces in the first two with:

'#<a[^>]*?href=("|\')?(.*?)[\s\"\'>]#is'

How can I formulate this so that it will pick up " and ' delimited with potential spaces, but also properly encoded URLs without delimiters.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初熏 2024-10-08 01:59:22

好的，这似乎有效：

'#<a[^>]*?href=((["\'][^\'"]+["\'])|([^"\'\s>]+))#is'

（$matches[1] 包含网址）

唯一的烦恼是带引号的网址仍然带有引号，因此您必须将它们删除：

$first = substr($match, 0, 1);
if($first == '"' || $first == "'")
    $match = substr($match, 1, -1);

OK, this seems to work:

'#<a[^>]*?href=((["\'][^\'"]+["\'])|([^"\'\s>]+))#is'

($matches[1] contains the urls)

Only annoyance is that quoted urls have the quotes still on, so you'll have to strip them off:

$first = substr($match, 0, 1);
if($first == '"' || $first == "'")
    $match = substr($match, 1, -1);

回复收藏 0 原文

霓裳挽歌倾城醉 2024-10-08 01:59:22

编辑：我已对其进行了编辑，使其比我最初发布的效果更好一些。

您几乎在第二个正则表达式中拥有它：

'#<a[^>]*?href=("|\')?(.*?)[\\1|>]#is'

返回以下数组：

array(3) {
  [0]=>
  array(4) {
    [0]=>
    string(92) "<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>"
    [1]=>
    string(101) "<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>"
    [2]=>
    string(94) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>"
    [3]=>
    string(77) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>"
  }
  [1]=>
  array(4) {
    [0]=>
    string(1) """
    [1]=>
    string(1) "'"
    [2]=>
    string(0) ""
    [3]=>
    string(0) ""
  }
  [2]=>
  array(4) {
    [0]=>
    string(74) "http://this.is.a.link.com/?query=this has invalid spaces" possible garbage"
    [1]=>
    string(83) "http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage"
    [2]=>
    string(77) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage"
    [3]=>
    string(60) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters"
  }
}

带或不带分隔符均可使用。

EDIT: I have edited this to work a little better than I originally posted.

You almost have it in the second regex:

'#<a[^>]*?href=("|\')?(.*?)[\\1|>]#is'

Returns the following array:

array(3) {
  [0]=>
  array(4) {
    [0]=>
    string(92) "<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>"
    [1]=>
    string(101) "<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>"
    [2]=>
    string(94) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>"
    [3]=>
    string(77) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>"
  }
  [1]=>
  array(4) {
    [0]=>
    string(1) """
    [1]=>
    string(1) "'"
    [2]=>
    string(0) ""
    [3]=>
    string(0) ""
  }
  [2]=>
  array(4) {
    [0]=>
    string(74) "http://this.is.a.link.com/?query=this has invalid spaces" possible garbage"
    [1]=>
    string(83) "http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage"
    [2]=>
    string(77) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage"
    [3]=>
    string(60) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters"
  }
}

Works with or without delimiters.

回复收藏 0 原文

静谧 2024-10-08 01:59:22

使用 DOM 解析器。您无法使用正则表达式解析 (x)HTML。

$html = <<<END
<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>
END;

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($html);
libxml_use_internal_errors(false);

$items = $domd->getElementsByTagName("a");
foreach ($items as $item) {
  var_dump($item->getAttribute("href"));
}

Use a DOM parser. You cannot parse (x)HTML with regular expressions.

$html = <<<END
<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>
END;

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($html);
libxml_use_internal_errors(false);

$items = $domd->getElementsByTagName("a");
foreach ($items as $item) {
  var_dump($item->getAttribute("href"));
}

回复收藏 0 原文

嗫嚅 2024-10-08 01:59:22

当您说要匹配它们时，您是否试图从链接中提取信息，或者只是查找带有 href 的超链接？如果您只追求后者，那么这应该可以正常工作：

/<a[^>]*href=[^\s].*?>/

When you say you want to match them, are you trying to extract information out of the links, or simply find hyperlinks with a href? If you're after only the latter, this should work just fine:

/<a[^>]*href=[^\s].*?>/

回复收藏 0 原文

帅哥哥的热头脑 2024-10-08 01:59:22

正如 @JasonWoof 所指出的，您需要使用嵌入式替换：一种用于引用 URL 的替代方案，另一种用于非引用 URL。我还建议使用捕获组来确定正在使用哪种引用，就像 @DanHorrigan 所做的那样。通过添加负向前瞻 ((?!\\2)) 和所有格量词 (*+)，您可以创建一个高度健壮的正则表达式，而且速度非常快：

~
<a\\s+[^>]*?\\bhref=
(
  (["'])          # capture the opening quote
  (?:(?!\\2).)*+  # anything else, zero or more times
  \\2             # match the closing quote
|
  [^\\s>]*+   # anything but whitespace or closing brackets
)
~ix

在 ideone 上查看它的运行情况。 （双反斜杠是因为正则表达式是这样写的以 PHP Heredoc 的形式，我更喜欢使用 nowdoc，但 ideone 显然仍在运行 PHP 5.2。）

As @JasonWoof indicated, you need to use an embedded alternation: one alternative for quoted URLs, one for non-quoted. I also recommend using a capturing group to determine which kind of quote is being used, as @DanHorrigan did. With the addition of a negative lookahead ((?!\\2)) and possessive quantifiers (*+), you can create a highly robust regex that is also very quick:

~
<a\\s+[^>]*?\\bhref=
(
  (["'])          # capture the opening quote
  (?:(?!\\2).)*+  # anything else, zero or more times
  \\2             # match the closing quote
|
  [^\\s>]*+   # anything but whitespace or closing brackets
)
~ix

See it in action on ideone. (The doubled backslashes are because the regex is written in the form of a PHP heredoc. I'd prefer to use a nowdoc, but ideone is apparently still running PHP 5.2.)

回复收藏 0 原文

~没有更多了~

关于作者

一片旧的回忆

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

对于各种链接和 href 分隔符（" 和 '）的正则表达式有点头疼

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

对于各种链接和 href 分隔符（" 和 '） 的正则表达式有点头疼

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

对于各种链接和 href 分隔符（" 和 '）的正则表达式有点头疼

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。