使用 DOMDocument 和 XPath 访问子 div
我正在构建一个供个人使用和学习目的的基本屏幕抓取工具,因此请不要发表诸如“您需要征求许可”等评论。
我尝试访问的数据结构如下:
<tr>
<td>
<div class="wrapper">
<div class="randomDiv">
<div class="divContent">
<div class="event">asd</div>
<div class="date">asd</div>
<div class="venue">asd</div>
<div class="state">asd</div>
</div>
</div>
</div>
</td>
</tr>
我正在尝试收集所有这些数据(因为给定页面上大约有 20 行)。
使用以下代码,我成功地收集了我需要的数据:
$remote = file_get_contents("linktoURL");
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$file = @$doc->loadHTML($remote);
$rows = $doc->getElementsByTagName('tr');
$xp = new DOMXpath($doc);
//initialize variables
$rows = array();
foreach($xp->query('//*[contains(@class, \'wrapper\')]', $doc) as $found) {
echo "<pre>";
print_r($found->nodeValue);
}
现在我的问题是,我将如何将所有这些数据存储到如下所示的关联数组中:
Array (
[0] => Array
(
[Event] => Name
[Date] => 12/12/12
[Venue] => NameOfPlace
[state] => state
)
[1] => Array
(
[Event] => Name
[Date] => 12/12/12
[Venue] => NameOfPlace
[state] => state
)
[2] => Array
(
[Event] => Name
[Date] => 12/12/12
[Venue] => NameOfPlace
[state] => state
)
)
现在,想到的唯一解决方案是调用在 foreach 循环中对每个类名 //*[contains(@class, \'className\')]
进行 xpath 查询。
是否有通过 DOMDocument 和 XPath 更惯用的方法,我可以在其中创建上述数据的关联数组?
编辑:
我不限于使用 DOMDocument 和 XPath,如果有其他可能更简单的解决方案,请发布它们。
I'm building a basic screen scraper for personal use and learning purposes, so please do not post comments like "You need to ask permission" etc.
The data I'm trying to access is structured as follows:
<tr>
<td>
<div class="wrapper">
<div class="randomDiv">
<div class="divContent">
<div class="event">asd</div>
<div class="date">asd</div>
<div class="venue">asd</div>
<div class="state">asd</div>
</div>
</div>
</div>
</td>
</tr>
I'm attempting to gather all this data (as there are about 20 rows on the given page).
Using the following code I have managed to gather the data I need:
$remote = file_get_contents("linktoURL");
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$file = @$doc->loadHTML($remote);
$rows = $doc->getElementsByTagName('tr');
$xp = new DOMXpath($doc);
//initialize variables
$rows = array();
foreach($xp->query('//*[contains(@class, \'wrapper\')]', $doc) as $found) {
echo "<pre>";
print_r($found->nodeValue);
}
Now my question is, how would I go about storing all this data into an associative array like below:
Array (
[0] => Array
(
[Event] => Name
[Date] => 12/12/12
[Venue] => NameOfPlace
[state] => state
)
[1] => Array
(
[Event] => Name
[Date] => 12/12/12
[Venue] => NameOfPlace
[state] => state
)
[2] => Array
(
[Event] => Name
[Date] => 12/12/12
[Venue] => NameOfPlace
[state] => state
)
)
Right now, the only solution that comes to mind would be to call the xpath query for each class name //*[contains(@class, \'className\')]
in the foreach loop.
Is there a more idiomatic way via DOMDocument and XPath wherein I am able to create an associative array of the above data?
edit:
I'm not limited to using DOMDocument and XPath, if there are other solutions which might be easier, then please post them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以通过注册 PHP 函数将某些功能导入 DOMXPath,但据我所知,您只能返回标量或节点集。
您可以使用
XSLTProcessor::transformToDoc()
通过简单的样式表对其进行转换,还可以将其导出到 SimpleXML 以便于访问。问题是它是否比手动搜索每个课程更快。当然,您可以使用
//div[contains(@class, 'event') 或 contains(@class, 'date')]
等来缩短 XPath 使用量。You can import some functionality into DOMXPath by registering PHP functions, but AFAIK you're limited to returning scalars or nodesets.
You could transform it with a simple stylesheet, using
XSLTProcessor::transformToDoc()
, possibly exporting it to SimpleXML for easier access. Question is whether it is any faster then searching for every class manually.You can of course shorten your XPath usage by using
//div[contains(@class, 'event') or contains(@class, 'date')]
etc.