如何使用 DOM 和 DOM 从页面中抓取链接X路径?
我有一个页面被卷曲刮掉了,我希望抓取具有特定 ID 的所有链接。据我所知,最好的方法是使用 dom 和 xpath。下面的代码获取了大量的 url,但删除了其中的许多并获取不是 url 的文本。
$curl_scraped_page 是用curl 刮取的页面。
$dom = new DOMDocument();
@$dom->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
我走在正确的轨道上吗?我是否只需要弄乱“/html/body//a”xpath 语法,还是需要添加更多内容来捕获 id 元素?
I have a page scraped with curl and am looking to grab all of the links with a certain id. As far as I can tell the best way to do this is with dom and xpath. The bellow code grabs a large number of the urls, but cuts many of them off and grabs text that is not a url.
$curl_scraped_page is the page scraped with curl.
$dom = new DOMDocument();
@$dom->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Am I on the right track? Do I just need to mess with the "/html/body//a" xpath syntax or do I need to add more to capture the id element?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您也可以这样做,您将拥有 onyl
a
标签,其中包含id
和href
:You can also do it this way and you'll have onyl
a
tags which have anid
andhref
:这是关于您的问题的解决方案。
This is the solution regarding your question.
http://simplehtmldom.sourceforge.net/
http://simplehtmldom.sourceforge.net/
我认为最简单的方法是组合以下 2 个类来从另一个网站提取信息:
从任何 HTML 标签、内容或标签属性中提取信息: http://simplehtmldom.sourceforge.net/
易于处理curl,支持POST请求:https://github.com/php-curl-class/php-curl-class
示例:
查看上方链接的《简单 HTML DOM 解析器手册》,了解 HTML 数据的操作。
I think that the easiest way is combining 2 following classes to pull information from another website:
Pull info from any HTML tag, contents or tag attribute: http://simplehtmldom.sourceforge.net/
Easy to handle curl, supports POST requests: https://github.com/php-curl-class/php-curl-class
Example:
Check Simple HTML DOM Parser Manual from the upper link for the manipulation with HTML data.