机械化html抓取问题

发布于 2024-07-20 10:28:51 字数 2852 浏览 9 评论 0原文

所以我尝试使用 ruby mechanize 和 hpricot 提取我网站的电子邮件。我试图在我的管理端的所有页面上进行循环并使用 hpricot 解析页面。到目前为止一切顺利。然后我得到：

Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*

当它解析一堆页面时，它以超时开始，然后打印页面的html代码。不明白为什么？我该如何调试呢？看起来 mechanize 可以连续打印 10 多页？是否可以？？谢谢



需要“记录器” 
  需要“红宝石” 
  要求“机械化” 
  需要“hpricot” 
  需要 'open-uri'

类 Harvester

 def 初始化（页面） 
      @页=页 
      @agent = WWW::Mechanize.new{|a|   a.log = Logger.new("logs.log") } 
      @agent.keep_alive=false 
      @agent.read_timeout=15

结束

默认登录 
      f = @agent.get("http://****.com/admin/index.asp") .forms.first 
      f.set_fields(:用户名=>“用户”,:密码=>“密码”) 
          f.提交
 
    结束

收获 
      页码=1 
      #@agent.read_timeout =  
      s.upto(@page) 做 |pagenb|

    puts "*************************** page= #{pagenb}/#{@page}***************************************"      
    begin
        #time=Time.now
        #[email protected]( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")          
        extract(pagenb)

    rescue => e
        puts  "unknown #{e.to_s}"
        #puts  "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
        #sleep(2)
        extract(pagenb)

    rescue Net::HTTPBadResponse => e
        puts "net exception"+ e.to_s
    rescue WWW::Mechanize::ResponseCodeError => ex
        puts "mechanize error: "+ex.response_code   
    rescue Timeout::Error => e
        puts "timeout: "+e.to_s
    end


end


  结束

def 摘录（页） 
        #puts search.body 
              [电子邮件受保护]( "http://***.com/ admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}") 
              doc = Hpricot(search.body)

        #remove titles
        #~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove 

        (doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|              
            #delete the phone number from the html
            temp = tr.search("/td[2]").inner_html
            index = temp.index('<')
            email = temp[0..index-1]
            puts  email
            f=File.open("./emails", 'a')
            f.puts(email)
            f.close     
        end 


 end

 end

放置“开始提取电子邮件...”

 start =ARGV[0].to_i

 h=Harvester.new(186) 
  h.登录 
  h.收获（开始）

原文

so i am trying to extract the email of my website using ruby mechanize and hpricot.
what i a trying to do its loop on all the page of my administration side and parse the pages with hpricot.so far so good. Then I get:

Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*

when it parse a bunch of page , its starts with a timeout and then print the html code of the page.
cant understand why? how can i debug that?
its seems like mechanize can get more than 10 page on a row ?? is it possible??
thanks

require 'logger' require 'rubygems' require 'mechanize' require 'hpricot' require 'open-uri'

class Harvester

def initialize(page) @page=page @agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") } @agent.keep_alive=false @agent.read_timeout=15

end

def login f = @agent.get( "http://****.com/admin/index.asp") .forms.first f.set_fields(:username => "user", :password =>"pass") f.submit
end

def harvest(s) pageNumber=1 #@agent.read_timeout = s.upto(@page) do |pagenb|

    puts "*************************** page= #{pagenb}/#{@page}***************************************"      
    begin
        #time=Time.now
        #[email protected]( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")          
        extract(pagenb)

    rescue => e
        puts  "unknown #{e.to_s}"
        #puts  "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
        #sleep(2)
        extract(pagenb)

    rescue Net::HTTPBadResponse => e
        puts "net exception"+ e.to_s
    rescue WWW::Mechanize::ResponseCodeError => ex
        puts "mechanize error: "+ex.response_code   
    rescue Timeout::Error => e
        puts "timeout: "+e.to_s
    end


end

end

def extract(page)
#puts search.body
[email protected]( "http://***.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
doc = Hpricot(search.body)

        #remove titles
        #~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove 

        (doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|              
            #delete the phone number from the html
            temp = tr.search("/td[2]").inner_html
            index = temp.index('<')
            email = temp[0..index-1]
            puts  email
            f=File.open("./emails", 'a')
            f.puts(email)
            f.close     
        end

end

puts "starting extacting emails ... "

start =ARGV[0].to_i

h=Harvester.new(186)
h.login
h.harvest(start)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋意浓 2024-07-27 10:28:51

Mechanize 将页面的完整内容放入历史记录中，这可能会在浏览多个页面时导致问题。要限制历史记录的大小，请尝试

@mech = WWW::Mechanize.new do |agent|
  agent.history.max_size = 1
end

Mechanize puts full content of a page into history, this may cause problems when browsing through many pages. To limit the size of history, try

@mech = WWW::Mechanize.new do |agent|
  agent.history.max_size = 1
end

回复收藏 0 原文

~没有更多了~

关于作者

埋葬我深情

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

机械化html抓取问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

吝吻

Jasmine

∞梦里开花

阳光①夏

暮念

梦里泪两行

友情链接

机械化html抓取问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

吝吻

Jasmine

∞梦里开花

阳光①夏

暮念

梦里泪两行

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。