如何使用 PhantomJS 进行蜘蛛抓取

发布于 2024-12-16 00:19:32 字数 147 浏览 3 评论 0原文

我正在尝试利用 PhantomJS 并抓取整个域。我想从根域开始,例如 www.domain.com - 提取所有链接(a.href),然后有一个获取每个新链接的队列,如果它们尚未被爬网或在队列中,则将新链接添加到队列中。

想法,帮助?

提前致谢!

I am trying to leverage PhantomJS and spider an entire domain. I want to start at the root domain e.g. www.domain.com - pull all links (a.href) and then have a que of fetching each new links and adding new links to the que if they haven't been crawled or in que.

Ideas, Help?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

深空失忆 2024-12-23 00:19:32

您可能有兴趣查看 Pjscrape (免责声明:这是我的项目),这是一个开源抓取库构建在 PhantomJS 之上。它内置了对蜘蛛抓取页面的支持,并随着页面的进展从每个页面中抓取信息。您可以使用这样的简短脚本来抓取整个网站,查看每个锚链接:

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

默认情况下,这将跳过已经抓取的页面并仅跟踪当前域上的链接,尽管这些都可以在您的设置中更改。

You might be interested in checking out Pjscrape (disclaimer: this is my project), an Open Source scraping library built on top of PhantomJS. It has built-in support for spidering pages and scraping information from each as it progresses. You could spider an entire site, looking at every anchor link, with a short script like this:

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

By default this will skip pages already spidered and only follow links on the current domain, though these can both be changed in your settings.

谁许谁一生繁华 2024-12-23 00:19:32

这是一个老问题,但更新一下,一个很棒的现代答案是 http://www.nightmarejs.org/ ( github: https://github.com/segmentio/nightmare )

从他们的主页引用一个令人信服的例子:

生的幻影:

phantom.create(function (ph) {
  ph.createPage(function (page) {
    page.open('http://yahoo.com', function (status) {
      page.evaluate(function () {
        var el =
          document.querySelector('input[title="Search"]');
        el.value = 'github nightmare';
      }, function (result) {
        page.evaluate(function () {
          var el = document.querySelector('.searchsubmit');
          var event = document.createEvent('MouseEvent');
          event.initEvent('click', true, false);
          el.dispatchEvent(event);
        }, function (result) {
          ph.exit();
        });
      });
    });
  });
});

与噩梦:

new Nightmare()
  .goto('http://yahoo.com')
  .type('input[title="Search"]', 'github nightmare')
  .click('.searchsubmit')
  .run();

This is an old question, but to update, an awesome modern answer is http://www.nightmarejs.org/ ( github: https://github.com/segmentio/nightmare )

Quoting a compelling example from their homepage:

RAW PHANTOMJS:

phantom.create(function (ph) {
  ph.createPage(function (page) {
    page.open('http://yahoo.com', function (status) {
      page.evaluate(function () {
        var el =
          document.querySelector('input[title="Search"]');
        el.value = 'github nightmare';
      }, function (result) {
        page.evaluate(function () {
          var el = document.querySelector('.searchsubmit');
          var event = document.createEvent('MouseEvent');
          event.initEvent('click', true, false);
          el.dispatchEvent(event);
        }, function (result) {
          ph.exit();
        });
      });
    });
  });
});

WITH NIGHTMARE:

new Nightmare()
  .goto('http://yahoo.com')
  .type('input[title="Search"]', 'github nightmare')
  .click('.searchsubmit')
  .run();
朱染 2024-12-23 00:19:32

首先,选择索引页面上的所有锚点并列出 href 值。您可以使用 PhantomJS 的文档选择器或 jQuery 选择器来完成此操作。然后对每个页面执行相同的操作,直到页面不再包含任何新链接。您应该有一个所有链接的主列表和每个页面的链接列表,以便能够确定链接是否已被处理。您可以将网络爬行视为一棵树。树的根节点是索引页,子节点是从索引页链接的页面。每个子节点可以有一个或多个子节点,具体取决于子页面包含的链接。我希望这有帮助。

First, select all anchors on the index page and make a list of the href values. You can either do this with PhantomJS' document selector or with jQuery selectors. Then for each page, do the same thing until a page no longer contains any new links. You should have a master list of all links and a list of links for each page to be able to determine if a link has already been processed. You can think of web crawling as like a tree. The root node of the tree is the index page and the child nodes are the pages linked from the index page. Each child node can have one or more children depending on the links that the child pages contain. I hope this helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文