机械化使用 2 次搜索进行刮擦?
我正在使用 Mechanize 抓取博客,试图获得以下结果。主要是很难将我的想法转化为代码逻辑。我假设我需要组合搜索子句并迭代 html 并在找到匹配项时打印出来。刚开始使用 Rails,任何建议都会有所帮助。
期望的结果:
- first_title
- first_image_url
- 第二张图片网址
- 第二标题
- first_image_url
- 第二张图片网址
代码:
require 'rubygems'
require 'mechanize'
url = 'http://blog.something.com/'
mech = Mechanize.new
page = mech.get(url)
page.search('h2').each do |h2|
puts h2.inner_text
end
imgs = page.search('img[src]').map{|src| src['src']}
puts imgs
当然会生成:
- first_title
- secondary_titlethird_title
- secondary_image_urlfirst_image_url
- ...
- first_image_url
- 代码
- ...
I am scraping a blog using Mechanize trying to get the results below. Mainly having trouble turning my thoughts into code logic. I assume I need to combine the search clauses and iterate through the html and prints out as it finds matches. New to using Rails and any advice will be helpful.
Desired results:
- first_title
- first_image_url
- second_image_url
- second_title
- first_image_url
- second_image_url
Code:
require 'rubygems'
require 'mechanize'
url = 'http://blog.something.com/'
mech = Mechanize.new
page = mech.get(url)
page.search('h2').each do |h2|
puts h2.inner_text
end
imgs = page.search('img[src]').map{|src| src['src']}
puts imgs
The code right of course produces:
- first_title
- second_title
- third_title
- ...
- first_image_url
- second_image_url
- first_image_url
- ...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设图像是 h2 的后代,你可以这样做:
assuming the images are descended from the h2 you could do: