如何提取html注释和节点包含的所有html?

发布于 2024-11-08 03:57:41 字数 1788 浏览 0 评论 0原文

我正在创建一个小型网络应用程序来帮助我管理和分析网站内容,而 cURL 是我最喜欢的新玩具。我已经弄清楚如何提取有关各种元素的信息,如何查找具有特定类的所有元素等,但我遇到了两个问题(见下文)。我希望有一些漂亮的 xpath 答案,但如果我必须诉诸正则表达式,我想那也可以。虽然我不太擅长正则表达式,所以如果您认为这是可行的方法,我会很欣赏示例...

非常标准的起点:

$ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);

    $html = curl_exec($ch);
    if (!$html) {
        $info .= "<br />cURL error number:" .curl_errno($ch);
        $info .= "<br />cURL error:" . curl_error($ch);
        return $info;
    }

    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    $xpath = new DOMXPath($dom);

和信息提取,例如:

// iframes
    $iframes = $xpath->evaluate("/html/body//iframe");
    $info .= '<h3>iframes ('.$iframes->length.'):</h3>';
    for ($i = 0; $i < $iframes->length; $i++) {
        // get iframe attributes
        $iframe = $iframes->item($i);
        $framesrc = $iframe->getAttribute("src");
        $framewidth = $iframe->getAttribute("width");
        $frameheight = $iframe->getAttribute("height");
        $framealt = $iframe->getAttribute("alt");
        $frameclass = $iframe->getAttribute("class");
        $info .= $framesrc.'&nbsp;('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
    }

问题/问题:

  1. 如何提取HTML 注释?

    我不知道如何识别注释 - 它们被视为节点,还是完全是其他东西?

  2. 如何获取div的全部内容,包括子节点?因此,如果 div 包含一个图像和几个 href,它会找到这些内容并将其作为 HTML 块全部返回给我。

I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...

Pretty standard starting point:

$ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);

    $html = curl_exec($ch);
    if (!$html) {
        $info .= "<br />cURL error number:" .curl_errno($ch);
        $info .= "<br />cURL error:" . curl_error($ch);
        return $info;
    }

    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    $xpath = new DOMXPath($dom);

and extraction of info, for example:

// iframes
    $iframes = $xpath->evaluate("/html/body//iframe");
    $info .= '<h3>iframes ('.$iframes->length.'):</h3>';
    for ($i = 0; $i < $iframes->length; $i++) {
        // get iframe attributes
        $iframe = $iframes->item($i);
        $framesrc = $iframe->getAttribute("src");
        $framewidth = $iframe->getAttribute("width");
        $frameheight = $iframe->getAttribute("height");
        $framealt = $iframe->getAttribute("alt");
        $frameclass = $iframe->getAttribute("class");
        $info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
    }

Questions/Problems:

  1. How to extract HTML comments?

    I can't figure out how to identify the comments – are they considered nodes, or something else entirely?

  2. How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

分分钟 2024-11-15 03:57:41

注释节点应该很容易在 XPath 中通过 comment() 测试找到,类似于 text() 测试:

$comments = $xpath->query('//comment()'); // or another path, as you prefer

它们是标准节点:这里是 DOMComment的手动条目。


对于你的另一个问题,这有点棘手。最简单的方法是使用 saveXML() 及其可选的 $node 参数:

$html = $dom->saveXML($el);  // $el should be the element you want to get 
                             // the HTML for

Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:

$comments = $xpath->query('//comment()'); // or another path, as you prefer

They are standard nodes: here is the manual entry for the DOMComment class.


To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:

$html = $dom->saveXML($el);  // $el should be the element you want to get 
                             // the HTML for
与风相奔跑 2024-11-15 03:57:41

对于 HTML 注释,一个快速方法是:

 function getComments ($html) {

     $rcomments = array();
     $comments = array();

     if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {

         foreach ($rcomments as $c) {
             $comments[] = $c[1];
         }

         return $comments;

     } else {
         // No comments matchs
         return null;
     }

 }

For the HTML comments a fast method is:

 function getComments ($html) {

     $rcomments = array();
     $comments = array();

     if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {

         foreach ($rcomments as $c) {
             $comments[] = $c[1];
         }

         return $comments;

     } else {
         // No comments matchs
         return null;
     }

 }
怀念你的温柔 2024-11-15 03:57:41

那个正则表达式
\s*
对你有帮助。

在正则表达式测试中

That Regex
\s*<!--[\s\S]+?-->
Helps to you.

In regex Test

千里故人稀 2024-11-15 03:57:41

对于您正在寻找递归正则表达式的评论。例如,要删除 html 注释:

preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);

找到它们:

preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);

for comments your looking for recursive regex. For instance, to get rid of html comments:

preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);

to find them:

preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文