EventMachine:EM 可以处理的最大并行 HTTP 请求是多少?
我正在构建一个分布式网络爬虫,并试图最大限度地利用每台机器的资源。我通过 Iterator 在 EventMachine 中运行解析函数,并使用 em-http-request 发出异步 HTTP 请求。目前我有 100 个同时运行的迭代,看来我无法通过这个级别。如果我增加迭代次数,它不会影响爬行速度。然而,我只获得 10-15% 的 CPU 负载和 20-30% 的网络负载,因此有足够的空间来加快爬行速度。
我正在使用 Ruby 1.9.2。有什么方法可以改进代码以有效地使用资源,或者我什至做错了?
def start_job_crawl
@redis.lpop @queue do |link|
if link.nil?
EventMachine::add_timer( 1 ){ start_job_crawl() }
else
#parsing link, using asynchronous http request,
#doing something with the content
parse(link)
end
end
end
#main reactor loop
EM.run {
EM.kqueue
@redis = EM::Protocols::Redis.connect(:host => "127.0.0.1")
@redis.errback do |code|
puts "Redis error: #{code}"
end
#100 parallel 'threads'. Want to increase this
EM::Iterator.new(0..99, 100).each do |num, iter|
start_job_crawl()
end
}
I'm building a distributed web-crawler and trying to get maximum out of resources of each single machine. I run parsing functions in EventMachine through Iterator and use em-http-request to make asynchronous HTTP requests. For now I have 100 iterations that run at the same time and it seems that I can't pass over this level. If I increase a number of iteration it doesn't affect the speed of crawling. However, I get only 10-15% cpu load and 20-30% of network load, so there's plenty of room to crawl faster.
I'm using Ruby 1.9.2. Is there any way to improve the code to use resources effectively or maybe I'm even doing it wrong?
def start_job_crawl
@redis.lpop @queue do |link|
if link.nil?
EventMachine::add_timer( 1 ){ start_job_crawl() }
else
#parsing link, using asynchronous http request,
#doing something with the content
parse(link)
end
end
end
#main reactor loop
EM.run {
EM.kqueue
@redis = EM::Protocols::Redis.connect(:host => "127.0.0.1")
@redis.errback do |code|
puts "Redis error: #{code}"
end
#100 parallel 'threads'. Want to increase this
EM::Iterator.new(0..99, 100).each do |num, iter|
start_job_crawl()
end
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您使用 select()(这是 EM 的默认设置),则最多为 1024,因为 select() 限制为 1024 个文件描述符。
不过,您似乎正在使用 kqueue,因此它应该能够同时处理超过 1024 个文件描述符。
if you are using select()(which is the default for EM), the most is 1024 because select() limited to 1024 file descriptors.
However it seems like you are using kqueue, so it should be able to handle much more than 1024 file descriptors at once.
你的 EM.threadpool_size 的值是多少?
尝试放大它,我怀疑限制不在 kqueue 中,而是在处理请求的池中......
which is the value of your EM.threadpool_size ?
try enlarging it, I suspect the limit is not in the kqueue but in the pool handling the requests...