html解析的dom和xpath查询

发布于 2024-12-17 06:38:11 字数 634 浏览 0 评论 0原文

我正在尝试编写一个机器人，它将每天获取 html 并进行解析。现在，为了解析html，我可以只使用像explode或正则表达式这样的字符串函数，但我发现dom xpath代码更清晰，所以现在我可以对我必须蜘蛛的所有站点和我必须剥离的标签进行配置：

'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'

所以代码看起来像这样

    @$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');


foreach ($tags as $tag) 
    echo $tag->nodeValue . "\n";

因此我得到了所有带有类文章描述的 div 标签，这很棒。但我注意到 div 标签内的所有 html 标签都被删除了。我想知道如何获取我正在查看的 div 的全部内容。

我还发现很难看到 $xpath->query() 的任何正确文档来了解如何形成字符串。 php 站点没有详细说明它的具体构成。不过，我的主要问题是

原文

I'm trying to write a robot that will be fetching html parsing it daily.
Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:

'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'

So the code looks like this

    @$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');


foreach ($tags as $tag) 
    echo $tag->nodeValue . "\n";

So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out.
I wonder how would I get the whole contents of that div I'm looking at.

I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迟月 2024-12-24 06:38:11

简单的答案是：

foreach ($tags as $tag) 
    echo $dom->saveXML($tag);

如果您希望 html 未剥离 a 标签，则 xpath 将是

//a[@class="articleDesc"]

假设 a 标签具有该 class 属性

The simple answer is:

foreach ($tags as $tag) 
    echo $dom->saveXML($tag);

If you want html unstripped a tags, the xpath would be

//a[@class="articleDesc"]

That's assuming the a tags have that class attribute

回复收藏 0 原文

不再让梦枯萎 2024-12-24 06:38:11

尝试使用 http://www.php.net/manual/en/simplexmlelement.asxml .php

或者，替代 :

function getNodeInnerHTML(DOMNode $oNode)   {
  $oDom = new DOMDocument();
  foreach($oNode->childNode as $oChild) {
    $oDom->appendChild($oDom->importNode($oChild, true));
  }
  return $oDom->saveHTML();
}

Try using http://www.php.net/manual/en/simplexmlelement.asxml.php

Or, alternative:

function getNodeInnerHTML(DOMNode $oNode)   {
  $oDom = new DOMDocument();
  foreach($oNode->childNode as $oChild) {
    $oDom->appendChild($oDom->importNode($oChild, true));
  }
  return $oDom->saveHTML();
}

回复收藏 0 原文

又怨 2024-12-24 06:38:11

这也应该加载所有内部标签。虽然它不是 DOM，但它们是可以互换的。稍后您可以 dom_import_simplexml 将其带回 DOM。

$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');

This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can dom_import_simplexml tobring it back into DOM.

$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');

回复收藏 0 原文

┾廆蒐ゝ 2024-12-24 06:38:11

你可以使用这个很棒的蜘蛛框架（Python）Scrapy

回复收藏 0 原文

~没有更多了~

关于作者

欢你一世

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

html解析的dom和xpath查询

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

池予

仲春光

seven

愿与i

燃烧我的卡路李先生

蒗幽

友情链接

html解析的dom和xpath查询

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

池予

仲春光

seven

愿与i

燃烧我的卡路李先生

蒗幽

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。