当前位置：文江博客话题详情

如何使用 PhantomJS 进行蜘蛛抓取

发布于 2024-12-16 00:19:32 字数 147 浏览 3 评论 0原文

我正在尝试利用 PhantomJS 并抓取整个域。我想从根域开始，例如 www.domain.com - 提取所有链接（a.href），然后有一个获取每个新链接的队列，如果它们尚未被爬网或在队列中，则将新链接添加到队列中。

想法，帮助？

提前致谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深空失忆 2024-12-23 00:19:32

您可能有兴趣查看 Pjscrape （免责声明：这是我的项目），这是一个开源抓取库构建在 PhantomJS 之上。它内置了对蜘蛛抓取页面的支持，并随着页面的进展从每个页面中抓取信息。您可以使用这样的简短脚本来抓取整个网站，查看每个锚链接：

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

默认情况下，这将跳过已经抓取的页面并仅跟踪当前域上的链接，尽管这些都可以在您的设置中更改。

You might be interested in checking out Pjscrape (disclaimer: this is my project), an Open Source scraping library built on top of PhantomJS. It has built-in support for spidering pages and scraping information from each as it progresses. You could spider an entire site, looking at every anchor link, with a short script like this:

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

By default this will skip pages already spidered and only follow links on the current domain, though these can both be changed in your settings.

回复收藏 0 原文

谁许谁一生繁华 2024-12-23 00:19:32

这是一个老问题，但更新一下，一个很棒的现代答案是 http://www.nightmarejs.org/ ( github: https://github.com/segmentio/nightmare )

从他们的主页引用一个令人信服的例子：

生的幻影：

phantom.create(function (ph) {
  ph.createPage(function (page) {
    page.open('http://yahoo.com', function (status) {
      page.evaluate(function () {
        var el =
          document.querySelector('input[title="Search"]');
        el.value = 'github nightmare';
      }, function (result) {
        page.evaluate(function () {
          var el = document.querySelector('.searchsubmit');
          var event = document.createEvent('MouseEvent');
          event.initEvent('click', true, false);
          el.dispatchEvent(event);
        }, function (result) {
          ph.exit();
        });
      });
    });
  });
});

与噩梦：

new Nightmare()
  .goto('http://yahoo.com')
  .type('input[title="Search"]', 'github nightmare')
  .click('.searchsubmit')
  .run();

This is an old question, but to update, an awesome modern answer is http://www.nightmarejs.org/ ( github: https://github.com/segmentio/nightmare )

Quoting a compelling example from their homepage:

RAW PHANTOMJS:

phantom.create(function (ph) {
  ph.createPage(function (page) {
    page.open('http://yahoo.com', function (status) {
      page.evaluate(function () {
        var el =
          document.querySelector('input[title="Search"]');
        el.value = 'github nightmare';
      }, function (result) {
        page.evaluate(function () {
          var el = document.querySelector('.searchsubmit');
          var event = document.createEvent('MouseEvent');
          event.initEvent('click', true, false);
          el.dispatchEvent(event);
        }, function (result) {
          ph.exit();
        });
      });
    });
  });
});

WITH NIGHTMARE:

new Nightmare()
  .goto('http://yahoo.com')
  .type('input[title="Search"]', 'github nightmare')
  .click('.searchsubmit')
  .run();

回复收藏 0 原文

朱染 2024-12-23 00:19:32

首先，选择索引页面上的所有锚点并列出 href 值。您可以使用 PhantomJS 的文档选择器或 jQuery 选择器来完成此操作。然后对每个页面执行相同的操作，直到页面不再包含任何新链接。您应该有一个所有链接的主列表和每个页面的链接列表，以便能够确定链接是否已被处理。您可以将网络爬行视为一棵树。树的根节点是索引页，子节点是从索引页链接的页面。每个子节点可以有一个或多个子节点，具体取决于子页面包含的链接。我希望这有帮助。

回复收藏 0 原文

~没有更多了~