使用正则表达式从 HTML 文档中的链接中提取 URL

发布于 2024-11-16 16:50:00 字数 636 浏览 6 评论 0原文

我需要捕获给定 html 中的所有链接。

这是示例代码：

<div class="infobar">
    ... some code goes here ...
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
    ... some code goes here ...
</div>

我需要获取 div.infobar 内以 /link/ 开头的所有链接

我尝试了这个：

preg_match_all('#<div class="infobar">.*?(href="/link/(.*?)") .*?</div>#is', $raw, $x);

但它给了我唯一的第一个匹配项。

感谢您的建议。

原文

I need to capture all links in a given html.

Here is sample code:

<div class="infobar">
    ... some code goes here ...
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
    ... some code goes here ...
</div>

I need to get all links inside div.infobar that starts with /link/

I tried this:

preg_match_all('#<div class="infobar">.*?(href="/link/(.*?)") .*?</div>#is', $raw, $x);

but it gives me the only first match.

Thanks for advices.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽手叙旧 2024-11-23 16:50:00

我建议为此目的使用 DOMDocument 而不是使用正则表达式。考虑以下简单代码：

$content = '
<div class="infobar">
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content);

// To hold all your links...
$links = array();

// Get all divs
$divs = $dom->getElementsByTagName("div");
foreach($divs as $div) {
  // Check the class attr of each div
  $cl = $div->getAttribute("class");
  if ($cl == "infobar") {
    // Find all hrefs and append it to our $links array
    $hrefs = $div->getElementsByTagName("a");
    foreach ($hrefs as $href)
       $links[] = $href->getAttribute("href");
  }
}
var_dump($links);

OUTPUT

array(4) {
  [0]=>
  string(15) "/link/some-text"
  [1]=>
  string(18) "/link/another-text"
  [2]=>
  string(12) "/link/blabla"
  [3]=>
  string(13) "/link/whassup"
}

I would suggest using DOMDocument for this very purpose rather than using regex. Consider following simple code:

$content = '
<div class="infobar">
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content);

// To hold all your links...
$links = array();

// Get all divs
$divs = $dom->getElementsByTagName("div");
foreach($divs as $div) {
  // Check the class attr of each div
  $cl = $div->getAttribute("class");
  if ($cl == "infobar") {
    // Find all hrefs and append it to our $links array
    $hrefs = $div->getElementsByTagName("a");
    foreach ($hrefs as $href)
       $links[] = $href->getAttribute("href");
  }
}
var_dump($links);

OUTPUT

array(4) {
  [0]=>
  string(15) "/link/some-text"
  [1]=>
  string(18) "/link/another-text"
  [2]=>
  string(12) "/link/blabla"
  [3]=>
  string(13) "/link/whassup"
}

回复收藏 0 原文

甲如呢乙后呢 2024-11-23 16:50:00

修改我之前的答案。您需要分两步完成：

//This first step grabs the contents of the div.
preg_match('#(?<=<div class="infobar">).*?(?=</div>)#is', $raw, $x);

//And here, we grab all of the links.
preg_match_all('#href="/link/(.*?)"#is', $x[0], $x);

Revising my previous answer. You'll need to do it in two steps:

//This first step grabs the contents of the div.
preg_match('#(?<=<div class="infobar">).*?(?=</div>)#is', $raw, $x);

//And here, we grab all of the links.
preg_match_all('#href="/link/(.*?)"#is', $x[0], $x);

回复收藏 0 原文

暮凉 2024-11-23 16:50:00

http://simplehtmldom.sourceforge.net/ ：

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

http://simplehtmldom.sourceforge.net/ :

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

回复收藏 0 原文

初懵 2024-11-23 16:50:00

试试这个（我添加了一个+）：

preg_match_all('#<div class="infobar">.*?(href="/link/(?:.*?)")+ .*?</div>#is', $raw, $x);

Try this (I added a +):

preg_match_all('#<div class="infobar">.*?(href="/link/(?:.*?)")+ .*?</div>#is', $raw, $x);

回复收藏 0 原文

~没有更多了~

关于作者

江城子

暂无简介

文章

26 人气

关注发私信

尘曦

文章 0 评论 0

关注

在梵高的星空下

文章 0 评论 0

关注

善良天后

文章 0 评论 0

关注

韬韬不绝

文章 0 评论 0

关注

qq_CgiN62

文章 0 评论 0

关注

不美如何

文章 0 评论 0

友情链接

文江博客

使用正则表达式从 HTML 文档中的链接中提取 URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

OUTPUT

OUTPUT

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

使用正则表达式从 HTML 文档中的链接中提取 URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

OUTPUT

OUTPUT

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。