使用 anemone 和 nokogiri 进行屏幕抓取需要帮助
我有一个 http://www.example.com/startpage 的起始页面,其中包含 1220 个列表以标准方式分页,例如每页 20 个结果。
我的代码可以解析结果的第一页,并跟踪网址中包含“example_guide/paris_shops”的链接。然后,我使用 Nokogiri 提取最后一页的特定数据。一切正常,20 个结果被写入一个文件。
但是,我似乎不知道如何让 Anemone 爬行到下一页结果(http://www.example.com/startpage?page=2),然后继续解析该页面,然后解析第三页页面 (http://www.example.com/startpage?page=3) 等等。
所以我想问是否有人知道如何让海葵在页面上启动,解析该页面上的所有链接(以及特定数据的下一级数据),然后按照分页到下一页结果这样海葵就可以再次开始解析,依此类推。鉴于分页链接与结果中的链接不同,Anemone 当然不会关注它们。
目前,我正在加载第一页结果的网址,让其完成,然后粘贴第二页结果的下一个网址等。非常手动且效率低下,尤其是对于获取数百页而言。
任何帮助将不胜感激。
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |
doc = Nokogiri::HTML(open(page.url))
name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?
open('savedwebdata.txt', 'a') { |f|
f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
end
end
I have a starting page of http://www.example.com/startpage which has 1220 listings broken up by pagination in the standard way eg 20 results per page.
I have code working that parses the first page of results and follows links that contain "example_guide/paris_shops" in their url. I then use Nokogiri to pull specific data of that final page. All works well and the 20 results are written to a file.
However I can't seem to figure out how to also get Anemone to crawl to the next page of results (http://www.example.com/startpage?page=2) and then continue to parse that page and then the 3rd page (http://www.example.com/startpage?page=3) and so on.
So I'd like to ask if anyone knows how I can get anemone to start on a page, parse all the links on that page (and the next level of data for specific data) but then follow the pagination to the next page of results so anemone can start parsing again and so on and on. Given that the pagination links are different from the links in the results Anemone doesn't of course follow them.
At the moment I am loading the url for the first page of results, letting that finish and then pasting in the next url for the 2nd page of results etc etc. Very manual and inefficient especially for getting hundreds of pages.
Any help would be much appreciated.
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |
doc = Nokogiri::HTML(open(page.url))
name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?
open('savedwebdata.txt', 'a') { |f|
f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
end
end
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
实际上 Anemone 内置了 nokogiri 文档。如果您调用 page.doc 这是一个 nokogiri 文档,那么不需要有两个 nokogiri 文档
actually Anemone has the nokogiri doc built into it. if you call page.doc that is a nokogiri doc so no need to have two nokogiri docs
如果没有实际的 HTML 或真实的网站,很难给出准确的例子。我已经多次完成了您想要做的事情,您实际上只需要
open-uri
和nokogiri
。有很多不同的方法可以确定如何从一个页面移动到另一个页面,但是当您知道页面上有多少个元素以及有多少个页面时,我会使用 1200 / 20 = 60 页的简单循环。该例程的要点如下:
您可能想要研究使用 Mechanize 来抓取网站。它本身不是一个爬虫,而是一个工具包,可以轻松浏览网站、填写表单并提交、处理身份验证、会话等。它在内部使用 Nokogiri,可以轻松遍历文档并提取内容使用常规的 Nokogiri 语法。
Without having actual HTML or a real site to hit it's hard to give exact examples. I've done what you're trying to do many times, and you really only need
open-uri
andnokogiri
.There are a bunch of different ways to determine how to move from one page to another, but when you know how many elements are on a page and how many pages there are I'd use a simple loop of 1200 / 20 = 60 pages. The gist of the routine looks like:
You might want to look into using Mechanize to crawl the site. It's not a crawler per se, but instead is a toolkit making it easy to navigate a site, fill in forms and submit them, deal with authentication, sessions, etc. It uses Nokogiri internally and makes it easy to walk the document and extract things using regular Nokogiri syntax.