Rails rake 任务在运行时不断消耗更多 RAM(使用 mechanize 抓取网站)
我使用 mechanize gem 来抓取网站。我在 Rails rake 任务中编写了一个非常简单的单线程爬虫,因为我需要访问 Rails 模型。
爬虫运行得很好,但在观察它运行一段时间后,我发现它随着时间的推移消耗了越来越多的 RAM,这很糟糕。
我用God gem来监控我的爬虫。
下面是我的 rake 任务代码,我想知道它是否会暴露内存泄漏的可能性?
task :abc => :environment do
prefix_url = 'http://example.com/abc-'
postfix_url = '.html'
from_page_id = (AppConfig.last_crawled_id || 1) + 1
to_page_id = 100000
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
(from_page_id..to_page_id).each do |i|
url = "#{prefix_url}#{i}#{postfix_url}"
puts "#{Time.now} - Crawl #{url}"
page = agent.get(url)
page.search('#content > ul').each do |s|
var = s.css('li')[0].text()
value = s.css('li')[1].text()
MyModel.create :var => var, :value => value
end
AppConfig.last_crawled_id = i
end
# Finish crawling, let's stop
`god stop crawl_abc`
end
I use mechanize gem to crawl websites. I wrote a very simple, one-threaded crawler inside a Rails rake task because I needed to access to Rails models.
The crawler runs just fine, but after watching it running for a while I can see that it eats more and more RAM over time, which is bad.
I use God gem to monitor my crawler.
Below is my rake task code, I'm wondering if it exposes any chance of memory leaking?
task :abc => :environment do
prefix_url = 'http://example.com/abc-'
postfix_url = '.html'
from_page_id = (AppConfig.last_crawled_id || 1) + 1
to_page_id = 100000
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
(from_page_id..to_page_id).each do |i|
url = "#{prefix_url}#{i}#{postfix_url}"
puts "#{Time.now} - Crawl #{url}"
page = agent.get(url)
page.search('#content > ul').each do |s|
var = s.css('li')[0].text()
value = s.css('li')[1].text()
MyModel.create :var => var, :value => value
end
AppConfig.last_crawled_id = i
end
# Finish crawling, let's stop
`god stop crawl_abc`
end
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
除非您拥有最新版本的 mechanize(2.1.1 仅在一天左右发布),否则默认情况下 mechanize 会以无限的历史记录大小运行,即它会保留您访问过的所有页面,因此会逐渐使用越来越多的页面记忆。
在您的情况下,这没有任何意义,因此在代理上调用
max_history=
应该限制以这种方式使用的内存量Unless you've got the very latest version of mechanize (2.1.1 was released only a day or so ago) by default mechanize operates with an unlimited history size, ie it keeps all the pages you visited and so will gradually use more and more memory.
In your case there isn't any point to this, so calling
max_history=
on your agent should limit how much memory is used in this fashion