有人知道 Ruby Mechanize 的缓存插件吗？

发布于 2024-10-31 05:18:39 字数 219 浏览 1 评论 0原文

我有一个基于 Mechanize 的 Ruby 脚本来抓取网站。我希望通过在本地缓存下载的 HTML 页面来加快速度，以使整个“调整输出 -> 运行 -> 调整输出”循环更快。我不希望仅仅为了这个脚本而在计算机上安装外部缓存。理想的解决方案将插件到 Mechanize 并透明地缓存获取的页面、图像等。

有人知道有一个图书馆可以做到这一点吗？或者实现相同结果的另一种方法（脚本第二轮运行得更快）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你是我的挚爱i 2024-11-07 05:18:39

做这种事情的一个好方法是使用（很棒的）VCR gem。

下面是如何执行此操作的示例：

require 'vcr'
require 'mechanize'

# Setup VCR's configs.  The cassette library directory is where 
# all of your "recordings" are saved as YAML files.  
VCR.configure do |c|
  c.cassette_library_dir = 'vcr_cassettes'
  c.hook_into :webmock
end

# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
  a = Mechanize.new
  a.get('http://google.com/')
end

如您所见... VCR 在第一次运行时将通信记录为 YAML 文件：

mario$  find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml

如果您想让 VCR 创建磁带的新版本，只需删除相应的文件即可。

A good way of doing this type of thing is to use the (AWESOME) VCR gem.

Here's an example of how you would do it:

require 'vcr'
require 'mechanize'

# Setup VCR's configs.  The cassette library directory is where 
# all of your "recordings" are saved as YAML files.  
VCR.configure do |c|
  c.cassette_library_dir = 'vcr_cassettes'
  c.hook_into :webmock
end

# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
  a = Mechanize.new
  a.get('http://google.com/')
end

As you can see... VCR records the communication as a YAML file on the first run:

mario$  find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml

If you want to have VCR create new versions of the cassettes, just delete the corresponding file.

回复收藏 0 原文

薄荷→糖丶微凉 2024-11-07 05:18:39

我不确定缓存页面是否会有那么大的帮助。更有帮助的是记录以前访问过的 URL，这样您就不会重复访问它们。页面缓存没有实际意义，因为当您第一次看到该页面时，您应该已经获取了重要信息，因此您需要做的就是检查是否已经看过它。如果有，请获取您关心的摘要信息并根据需要对其进行操作。

我曾经使用 Perl 的 Mechanize 编写分析蜘蛛。 Ruby 的 Mechanize 就是基于它的。将以前访问过的 URL 存储在某种类型的缓存中是有用的，就像哈希一样，但是，由于应用程序崩溃或主机在会话中途宕机，以前的所有结果都会消失。那时，一个真正的基于磁盘的数据库是必不可少的。

我喜欢 Postgres，但 SQLite 也是一个不错的选择。无论您使用什么，都可以获取有关驱动器的重要信息，使其能够在重新启动或崩溃后幸存下来。

我建议的其他内容是使用 YAML 文件来配置您的应用程序。将应用程序运行期间可能更改的每个参数放在那里。然后，编写应用程序，以便它定期检查该文件的修改时间，并在发生更改时重新加载它。这样，您就可以动态调整其运行时行为。几年前，我不得不编写一个蜘蛛程序来分析一家财富 50 强公司的多个网站。该应用程序运行了三周，抓取了与该公司相关的许多不同网站，并且因为我可以调整用于控制应用程序处理的页面的正则表达式，所以我可以在不关闭该应用程序的情况下对其进行微调。

回复收藏 0 原文

一紙繁鸢 2024-11-07 05:18:39

如果您在第一次请求后存储了有关页面的一些信息，则可以稍后重建页面，而无需从服务器重新请求它。

# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]

# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)

我在蜘蛛/抓取器中使用了这种技术，这样就可以调整代码而不必重新请求所有页面。例如：

# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
#    for in-memory storage, you could use a Hash.
#    or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)

  def get_cached(uri)
    cache_key = "_cache/#{uri}"

    if args = storage[cache_key]
      logger.debug("getting (cached) #{uri}")
      uri, response, body, code = args
      page = Mechanize::Page.new(uri, response, body, code, agent)
      agent.send(:add_to_history, page)
      page

    else
      logger.debug("getting (UNCACHED) #{uri}")
      page = agent.get(uri)
      storage[cache_key] = [page.uri, page.response, page.body, page.code]
      page

    end
  end

end

您可以像这样使用它：

require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'

storage = {}

foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")

pp storage

它打印以下信息：

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}

If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.

# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]

# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)

I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:

# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
#    for in-memory storage, you could use a Hash.
#    or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)

  def get_cached(uri)
    cache_key = "_cache/#{uri}"

    if args = storage[cache_key]
      logger.debug("getting (cached) #{uri}")
      uri, response, body, code = args
      page = Mechanize::Page.new(uri, response, body, code, agent)
      agent.send(:add_to_history, page)
      page

    else
      logger.debug("getting (UNCACHED) #{uri}")
      page = agent.get(uri)
      storage[cache_key] = [page.uri, page.response, page.body, page.code]
      page

    end
  end

end

Which you could use like this:

require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'

storage = {}

foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")

pp storage

Which prints the following information:

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}

回复收藏 0 原文