EventMachine：EM 可以处理的最大并行 HTTP 请求是多少？

发布于 2024-10-22 12:15:18 字数 946 浏览 6 评论 0原文

我正在构建一个分布式网络爬虫，并试图最大限度地利用每台机器的资源。我通过 Iterator 在 EventMachine 中运行解析函数，并使用 em-http-request 发出异步 HTTP 请求。目前我有 100 个同时运行的迭代，看来我无法通过这个级别。如果我增加迭代次数，它不会影响爬行速度。然而，我只获得 10-15% 的 CPU 负载和 20-30% 的网络负载，因此有足够的空间来加快爬行速度。

我正在使用 Ruby 1.9.2。有什么方法可以改进代码以有效地使用资源，或者我什至做错了？

def start_job_crawl     
  @redis.lpop @queue do |link|
    if link.nil?                
      EventMachine::add_timer( 1 ){ start_job_crawl() }
    else   
      #parsing link, using asynchronous http request, 
      #doing something with the content                         
      parse(link)
    end          
  end        
end

#main reactor loop   

EM.run {   
 EM.kqueue   

 @redis = EM::Protocols::Redis.connect(:host => "127.0.0.1")
 @redis.errback do |code|
  puts "Redis error: #{code}"
 end

 #100 parallel 'threads'. Want to increase this     

  EM::Iterator.new(0..99, 100).each do |num, iter| 
      start_job_crawl()    
  end
}

原文

I'm building a distributed web-crawler and trying to get maximum out of resources of each single machine. I run parsing functions in EventMachine through Iterator and use em-http-request to make asynchronous HTTP requests. For now I have 100 iterations that run at the same time and it seems that I can't pass over this level. If I increase a number of iteration it doesn't affect the speed of crawling. However, I get only 10-15% cpu load and 20-30% of network load, so there's plenty of room to crawl faster.

I'm using Ruby 1.9.2. Is there any way to improve the code to use resources effectively or maybe I'm even doing it wrong?

def start_job_crawl     
  @redis.lpop @queue do |link|
    if link.nil?                
      EventMachine::add_timer( 1 ){ start_job_crawl() }
    else   
      #parsing link, using asynchronous http request, 
      #doing something with the content                         
      parse(link)
    end          
  end        
end

#main reactor loop   

EM.run {   
 EM.kqueue   

 @redis = EM::Protocols::Redis.connect(:host => "127.0.0.1")
 @redis.errback do |code|
  puts "Redis error: #{code}"
 end

 #100 parallel 'threads'. Want to increase this     

  EM::Iterator.new(0..99, 100).each do |num, iter| 
      start_job_crawl()    
  end
}

分享到QQ

分享到微博