html解析的dom和xpath查询
我正在尝试编写一个机器人,它将每天获取 html 并进行解析。 现在,为了解析html,我可以只使用像explode或正则表达式这样的字符串函数,但我发现dom xpath代码更清晰,所以现在我可以对我必须蜘蛛的所有站点和我必须剥离的标签进行配置:
'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'
所以代码看起来像这样
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');
foreach ($tags as $tag)
echo $tag->nodeValue . "\n";
因此我得到了所有带有类文章描述的 div 标签,这很棒。但我注意到 div 标签内的所有 html 标签都被删除了。 我想知道如何获取我正在查看的 div 的全部内容。
我还发现很难看到 $xpath->query() 的任何正确文档来了解如何形成字符串。 php 站点没有详细说明它的具体构成。不过,我的主要问题是
I'm trying to write a robot that will be fetching html parsing it daily.
Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:
'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'
So the code looks like this
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');
foreach ($tags as $tag)
echo $tag->nodeValue . "\n";
So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out.
I wonder how would I get the whole contents of that div I'm looking at.
I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
简单的答案是:
如果您希望 html 未剥离 a 标签,则 xpath 将是
假设 a 标签具有该 class 属性
The simple answer is:
If you want html unstripped a tags, the xpath would be
That's assuming the a tags have that class attribute
尝试使用 http://www.php.net/manual/en/simplexmlelement.asxml .php
或者,替代 :
Try using http://www.php.net/manual/en/simplexmlelement.asxml.php
Or, alternative:
这也应该加载所有内部标签。虽然它不是 DOM,但它们是可以互换的。稍后您可以
dom_import_simplexml
将其带回 DOM。This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can
dom_import_simplexml
tobring it back into DOM.你可以使用这个很棒的蜘蛛框架(Python)Scrapy
You could use this awesome spider framework (in Python) Scrapy