如何有选择地删除 DOM 文档子树中的节点?
我正在使用 DOMDocument 来解析 html 文档并从中获取一些数据。以下是 DOM 子树的结构,
<div id="tab1">
<div class="some class name"></div>
<div class="some other class name">arbitrary data and nodes</div>
<p> lot of paragraphs to follow </p>
<p> paragraphs </p>
<p> paragraphs </p>
<p> paragraphs </p>
<p> paragraphs </p>
<br />
<br />
<br />
<br />
<br />
<table />
<table />
<table />
<table />
</div>
我不想要 tab1 的前两个子树。我正在使用以下 PHP 代码
<?php
$urlArray = file('sitemap.txt');
$dataSet = array();
foreach($urlArray as $url){
$scrapedData = file_get_contents('./scraped-site/'.trim($url));
$doc = new DOMDocument();
@$doc->loadHTML($scrapedData);
$domXpathDoc = new DOMXPath($doc);
$results = '';
$xpathArray = array(
'info'=>'//*[@id="tabs1"]',
);
$set = array();
foreach($xpathArray as $field => $xpath){
$domNodeList = $domXpathDoc->query($xpath);
foreach($domNodeList as $node){
foreach ($node->childNodes as $child) {
$set[] = $child->ownerDocument->saveXML( $child );
}
}
}
$dataSet[] = $set;
}
给出的代码给了我所有的孩子我如何有选择地避免任何节点?
I am using DOMDocument to parse an html document and get some data out of it. Following is the structure of sub-tree of DOM
<div id="tab1">
<div class="some class name"></div>
<div class="some other class name">arbitrary data and nodes</div>
<p> lot of paragraphs to follow </p>
<p> paragraphs </p>
<p> paragraphs </p>
<p> paragraphs </p>
<p> paragraphs </p>
<br />
<br />
<br />
<br />
<br />
<table />
<table />
<table />
<table />
</div>
I do not want first two children of tab1. I am using following PHP Code
<?php
$urlArray = file('sitemap.txt');
$dataSet = array();
foreach($urlArray as $url){
$scrapedData = file_get_contents('./scraped-site/'.trim($url));
$doc = new DOMDocument();
@$doc->loadHTML($scrapedData);
$domXpathDoc = new DOMXPath($doc);
$results = '';
$xpathArray = array(
'info'=>'//*[@id="tabs1"]',
);
$set = array();
foreach($xpathArray as $field => $xpath){
$domNodeList = $domXpathDoc->query($xpath);
foreach($domNodeList as $node){
foreach ($node->childNodes as $child) {
$set[] = $child->ownerDocument->saveXML( $child );
}
}
}
$dataSet[] = $set;
}
The code given gives me all children how can I selectively avoid any node?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
[编辑2:我尝试了下面的答案(我学到了:))。这对我有用:
基本上它告诉 xpath 忽略所有名为“div”的元素。您可以忽略多个元素,如下所示:
仅显示前两个元素之后的元素将如下所示:
[EDIT2: I tried the answer below (I learned :) ). This is working for me:
Basically it tells the xpath to ignore all elements with name 'div'. You can ignore more than one element like this:
Only showing elements after the first two would work like this: