html解析的dom和xpath查询

发布于 2024-12-17 06:38:11 字数 634 浏览 0 评论 0原文

我正在尝试编写一个机器人,它将每天获取 html 并进行解析。 现在,为了解析html,我可以只使用像explode或正则表达式这样的字符串函数,但我发现dom xpath代码更清晰,所以现在我可以对我必须蜘蛛的所有站点和我必须剥离的标签进行配置:

'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'

所以代码看起来像这样

    @$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');


foreach ($tags as $tag) 
    echo $tag->nodeValue . "\n";

因此我得到了所有带有类文章描述的 div 标签,这很棒。但我注意到 div 标签内的所有 html 标签都被删除了。 我想知道如何获取我正在查看的 div 的全部内容。

我还发现很难看到 $xpath->query() 的任何正确文档来了解如何形成字符串。 php 站点没有详细说明它的具体构成。不过,我的主要问题是

I'm trying to write a robot that will be fetching html parsing it daily.
Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:

'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'

So the code looks like this

    @$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');


foreach ($tags as $tag) 
    echo $tag->nodeValue . "\n";

So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out.
I wonder how would I get the whole contents of that div I'm looking at.

I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

迟月 2024-12-24 06:38:11

简单的答案是:

foreach ($tags as $tag) 
    echo $dom->saveXML($tag);

如果您希望 html 未剥离 a 标签,则 xpath 将是

//a[@class="articleDesc"]

假设 a 标签具有该 class 属性

The simple answer is:

foreach ($tags as $tag) 
    echo $dom->saveXML($tag);

If you want html unstripped a tags, the xpath would be

//a[@class="articleDesc"]

That's assuming the a tags have that class attribute

不再让梦枯萎 2024-12-24 06:38:11

尝试使用 http://www.php.net/manual/en/simplexmlelement.asxml .php

或者,替代 :

function getNodeInnerHTML(DOMNode $oNode)   {
  $oDom = new DOMDocument();
  foreach($oNode->childNode as $oChild) {
    $oDom->appendChild($oDom->importNode($oChild, true));
  }
  return $oDom->saveHTML();
}

Try using http://www.php.net/manual/en/simplexmlelement.asxml.php

Or, alternative:

function getNodeInnerHTML(DOMNode $oNode)   {
  $oDom = new DOMDocument();
  foreach($oNode->childNode as $oChild) {
    $oDom->appendChild($oDom->importNode($oChild, true));
  }
  return $oDom->saveHTML();
}
又怨 2024-12-24 06:38:11

这也应该加载所有内部标签。虽然它不是 DOM,但它们是可以互换的。稍后您可以 dom_import_simplexml 将其带回 DOM。

$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');

This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can dom_import_simplexml tobring it back into DOM.

$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');
┾廆蒐ゝ 2024-12-24 06:38:11

你可以使用这个很棒的蜘蛛框架(Python)Scrapy

You could use this awesome spider framework (in Python) Scrapy

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文