机械化html抓取问题
所以我尝试使用 ruby mechanize 和 hpricot 提取我网站的电子邮件。 我试图在我的管理端的所有页面上进行循环并使用 hpricot 解析页面。到目前为止一切顺利。 然后我得到:
Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*
当它解析一堆页面时,它以超时开始,然后打印页面的html代码。 不明白为什么? 我该如何调试呢? 看起来 mechanize 可以连续打印 10 多页? 是否可以?? 谢谢
需要“记录器”
需要“红宝石”
要求“机械化”
需要“hpricot”
需要 'open-uri'
类 Harvester
def 初始化(页面)
@页=页
@agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") }
@agent.keep_alive=false
@agent.read_timeout=15
结束
默认登录
f = @agent.get("http://****.com/admin/index.asp") .forms.first
f.set_fields(:用户名=>“用户”,:密码=>“密码”)
f.提交
结束
收获
页码=1
#@agent.read_timeout =
s.upto(@page) 做 |pagenb|
puts "*************************** page= #{pagenb}/#{@page}***************************************"
begin
#time=Time.now
#[email protected]( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
extract(pagenb)
rescue => e
puts "unknown #{e.to_s}"
#puts "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
#sleep(2)
extract(pagenb)
rescue Net::HTTPBadResponse => e
puts "net exception"+ e.to_s
rescue WWW::Mechanize::ResponseCodeError => ex
puts "mechanize error: "+ex.response_code
rescue Timeout::Error => e
puts "timeout: "+e.to_s
end
end
结束
def 摘录(页)
#puts search.body
[电子邮件受保护]( "http://***.com/ admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
doc = Hpricot(search.body)
#remove titles
#~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove
(doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|
#delete the phone number from the html
temp = tr.search("/td[2]").inner_html
index = temp.index('<')
email = temp[0..index-1]
puts email
f=File.open("./emails", 'a')
f.puts(email)
f.close
end
end
end
放置“开始提取电子邮件...”
start =ARGV[0].to_i
h=Harvester.new(186)
h.登录
h.收获(开始)
so i am trying to extract the email of my website using ruby mechanize and hpricot.
what i a trying to do its loop on all the page of my administration side and parse the pages with hpricot.so far so good. Then I get:
Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*
when it parse a bunch of page , its starts with a timeout and then print the html code of the page.
cant understand why? how can i debug that?
its seems like mechanize can get more than 10 page on a row ?? is it possible??
thanks
require 'logger' require 'rubygems' require 'mechanize' require 'hpricot' require 'open-uri'
class Harvester
def initialize(page) @page=page @agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") } @agent.keep_alive=false @agent.read_timeout=15
end
def login f = @agent.get( "http://****.com/admin/index.asp") .forms.first f.set_fields(:username => "user", :password =>"pass") f.submit
end
def harvest(s) pageNumber=1 #@agent.read_timeout = s.upto(@page) do |pagenb|
puts "*************************** page= #{pagenb}/#{@page}***************************************"
begin
#time=Time.now
#[email protected]( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
extract(pagenb)
rescue => e
puts "unknown #{e.to_s}"
#puts "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
#sleep(2)
extract(pagenb)
rescue Net::HTTPBadResponse => e
puts "net exception"+ e.to_s
rescue WWW::Mechanize::ResponseCodeError => ex
puts "mechanize error: "+ex.response_code
rescue Timeout::Error => e
puts "timeout: "+e.to_s
end
end
end
def extract(page)
#puts search.body
[email protected]( "http://***.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
doc = Hpricot(search.body)
#remove titles
#~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove
(doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|
#delete the phone number from the html
temp = tr.search("/td[2]").inner_html
index = temp.index('<')
email = temp[0..index-1]
puts email
f=File.open("./emails", 'a')
f.puts(email)
f.close
end
end
end
puts "starting extacting emails ... "
start =ARGV[0].to_i
h=Harvester.new(186)
h.login
h.harvest(start)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Mechanize 将页面的完整内容放入历史记录中,这可能会在浏览多个页面时导致问题。 要限制历史记录的大小,请尝试
Mechanize puts full content of a page into history, this may cause problems when browsing through many pages. To limit the size of history, try