如何使用 PhantomJS 进行蜘蛛抓取
我正在尝试利用 PhantomJS 并抓取整个域。我想从根域开始,例如 www.domain.com - 提取所有链接(a.href),然后有一个获取每个新链接的队列,如果它们尚未被爬网或在队列中,则将新链接添加到队列中。
想法,帮助?
提前致谢!
I am trying to leverage PhantomJS and spider an entire domain. I want to start at the root domain e.g. www.domain.com - pull all links (a.href) and then have a que of fetching each new links and adding new links to the que if they haven't been crawled or in que.
Ideas, Help?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能有兴趣查看 Pjscrape (免责声明:这是我的项目),这是一个开源抓取库构建在 PhantomJS 之上。它内置了对蜘蛛抓取页面的支持,并随着页面的进展从每个页面中抓取信息。您可以使用这样的简短脚本来抓取整个网站,查看每个锚链接:
默认情况下,这将跳过已经抓取的页面并仅跟踪当前域上的链接,尽管这些都可以在您的设置中更改。
You might be interested in checking out Pjscrape (disclaimer: this is my project), an Open Source scraping library built on top of PhantomJS. It has built-in support for spidering pages and scraping information from each as it progresses. You could spider an entire site, looking at every anchor link, with a short script like this:
By default this will skip pages already spidered and only follow links on the current domain, though these can both be changed in your settings.
这是一个老问题,但更新一下,一个很棒的现代答案是 http://www.nightmarejs.org/ ( github: https://github.com/segmentio/nightmare )
从他们的主页引用一个令人信服的例子:
生的幻影:
与噩梦:
This is an old question, but to update, an awesome modern answer is http://www.nightmarejs.org/ ( github: https://github.com/segmentio/nightmare )
Quoting a compelling example from their homepage:
RAW PHANTOMJS:
WITH NIGHTMARE:
首先,选择索引页面上的所有锚点并列出 href 值。您可以使用 PhantomJS 的文档选择器或 jQuery 选择器来完成此操作。然后对每个页面执行相同的操作,直到页面不再包含任何新链接。您应该有一个所有链接的主列表和每个页面的链接列表,以便能够确定链接是否已被处理。您可以将网络爬行视为一棵树。树的根节点是索引页,子节点是从索引页链接的页面。每个子节点可以有一个或多个子节点,具体取决于子页面包含的链接。我希望这有帮助。
First, select all anchors on the index page and make a list of the href values. You can either do this with PhantomJS' document selector or with jQuery selectors. Then for each page, do the same thing until a page no longer contains any new links. You should have a master list of all links and a list of links for each page to be able to determine if a link has already been processed. You can think of web crawling as like a tree. The root node of the tree is the index page and the child nodes are the pages linked from the index page. Each child node can have one or more children depending on the links that the child pages contain. I hope this helps.