Rails rake 任务在运行时不断消耗更多 RAM(使用 mechanize 抓取网站)

发布于 2025-01-03 01:44:07 字数 907 浏览 1 评论 0原文

我使用 mechanize gem 来抓取网站。我在 Rails rake 任务中编写了一个非常简单的单线程爬虫,因为我需要访问 Rails 模型。

爬虫运行得很好,但在观察它运行一段时间后,我发现它随着时间的推移消耗了越来越多的 RAM,这很糟糕。

我用God gem来监控我的爬虫。

下面是我的 rake 任务代码,我想知道它是否会暴露内存泄漏的可能性?

task :abc => :environment do
  prefix_url = 'http://example.com/abc-'
  postfix_url = '.html'
  from_page_id = (AppConfig.last_crawled_id || 1) + 1
  to_page_id = 100000

  agent = Mechanize.new
  agent.user_agent_alias = 'Mac Safari'

  (from_page_id..to_page_id).each do |i|
    url = "#{prefix_url}#{i}#{postfix_url}"
    puts "#{Time.now} - Crawl #{url}"
    page = agent.get(url)

    page.search('#content > ul').each do |s|
      var = s.css('li')[0].text()
      value = s.css('li')[1].text()
      MyModel.create :var => var, :value => value
    end

    AppConfig.last_crawled_id = i
  end
  # Finish crawling, let's stop
  `god stop crawl_abc`
end

I use mechanize gem to crawl websites. I wrote a very simple, one-threaded crawler inside a Rails rake task because I needed to access to Rails models.

The crawler runs just fine, but after watching it running for a while I can see that it eats more and more RAM over time, which is bad.

I use God gem to monitor my crawler.

Below is my rake task code, I'm wondering if it exposes any chance of memory leaking?

task :abc => :environment do
  prefix_url = 'http://example.com/abc-'
  postfix_url = '.html'
  from_page_id = (AppConfig.last_crawled_id || 1) + 1
  to_page_id = 100000

  agent = Mechanize.new
  agent.user_agent_alias = 'Mac Safari'

  (from_page_id..to_page_id).each do |i|
    url = "#{prefix_url}#{i}#{postfix_url}"
    puts "#{Time.now} - Crawl #{url}"
    page = agent.get(url)

    page.search('#content > ul').each do |s|
      var = s.css('li')[0].text()
      value = s.css('li')[1].text()
      MyModel.create :var => var, :value => value
    end

    AppConfig.last_crawled_id = i
  end
  # Finish crawling, let's stop
  `god stop crawl_abc`
end

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

尬尬 2025-01-10 01:44:07

除非您拥有最新版本的 mechanize(2.1.1 仅在一天左右发布),否则默认情况下 mechanize 会以无限的历史记录大小运行,即它会保留您访问过的所有页面,因此会逐渐使用越来越多的页面记忆。

在您的情况下,这没有任何意义,因此在代理上调用 max_history= 应该限制以这种方式使用的内存量

Unless you've got the very latest version of mechanize (2.1.1 was released only a day or so ago) by default mechanize operates with an unlimited history size, ie it keeps all the pages you visited and so will gradually use more and more memory.

In your case there isn't any point to this, so calling max_history= on your agent should limit how much memory is used in this fashion

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文