如何使用 Ruby 抓取具有多个页面的网站并创建一个 html 页面?
所以我想做的是抓取这个网站: http://boxerbiography.blogspot.com/ 并创建一个 HTML 页面,我可以打印该页面或将其发送到我的 Kindle。
我正在考虑使用 Hpricot,但不太确定如何继续。
我该如何设置它,以便它递归地检查每个链接,获取 HTML,将其存储在变量中或将其转储到 HTML 主页面,然后返回到目录并继续执行此操作?
您不必确切地告诉我如何去做,而只需告诉我我可能想要如何处理它背后的理论。
我是否真的必须查看其中一篇文章的来源(顺便说一句,这是极其难看的),例如查看来源:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e -sports-icon.html 并手动编写脚本以提取某些标签之间的文本(例如 h3、p 等)?
如果我采用这种方法,那么我将不得不查看每个章节/文章的每个单独来源,然后执行此操作。有点违背了编写脚本来做到这一点的目的,不是吗?
理想情况下,我想要一个脚本能够区分 JS 和其他代码之间的区别,并且只是“文本”并将其转储(使用正确的标题等进行格式化)。
非常感谢一些指导。
谢谢。
So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/
and create one HTML page that I can either print or send to my Kindle.
I am thinking of using Hpricot, but am not too sure how to proceed.
How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?
You don't have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.
Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?
If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?
Ideally I would like a script that will be able to tell the difference between JS and other code and just the 'text' and dump it (formatted with the proper headings and such).
Would really appreciate some guidance.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议使用 Nokogiri 而不是 Hpricot。它更强大,使用更少的资源,更少的错误,更容易使用,而且速度更快。
为了按时工作,我做了一些广泛的抓取工作,不得不切换到 Nokogiri,因为 Hpricot 会莫名其妙地在某些页面上崩溃。
检查此 RailsCast:
http://railscasts.com/episodes/190-screen-scraping -with-nokogiri
和:
http://nokogiri.org/
http://www.rubyinside.com/nokogiri -ruby-html-parser-and-xml-parser-1288.html
http://www.engineyard.com/blog/2010/getting-started-与-nokogiri/
I'd recomment using Nokogiri instead of Hpricot. It's more robust, uses less resources, fewer bugs, it's easier to use, and faster.
I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.
Check this RailsCast:
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
and:
http://nokogiri.org/
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html
http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/