有人知道 Ruby Mechanize 的缓存插件吗?

发布于 2024-10-31 05:18:39 字数 219 浏览 1 评论 0原文

我有一个基于 Mechanize 的 Ruby 脚本来抓取网站。我希望通过在本地缓存下载的 HTML 页面来加快速度,以使整个“调整输出 -> 运行 -> 调整输出”循环更快。我不希望仅仅为了这个脚本而在计算机上安装外部缓存。理想的解决方案将插件到 Mechanize 并透明地缓存获取的页面、图像等。

有人知道有一个图书馆可以做到这一点吗?或者实现相同结果的另一种方法(脚本第二轮运行得更快)?

I have a Mechanize based Ruby script to scrape a website. I am hoping to speed it up by caching the downloaded HTML pages locally to make the whole "tweak output -> run -> tweak output" cycle quicker. I would prefer not to have to install an external cache on the machine just for this script. The ideal solution would plugin to Mechanize and transparently cache fetched pages, images and so on.

Anyone know of a library that will do this? Or another way of achieving the same outcome (script runs much quicker second time round)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

你是我的挚爱i 2024-11-07 05:18:39

做这种事情的一个好方法是使用(很棒的)VCR gem

下面是如何执行此操作的示例:

require 'vcr'
require 'mechanize'

# Setup VCR's configs.  The cassette library directory is where 
# all of your "recordings" are saved as YAML files.  
VCR.configure do |c|
  c.cassette_library_dir = 'vcr_cassettes'
  c.hook_into :webmock
end

# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
  a = Mechanize.new
  a.get('http://google.com/')
end

如您所见... VCR 在第一次运行时将通信记录为 YAML 文件:

mario$  find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml

如果您想让 VCR 创建磁带的新版本,只需删除相应的文件即可。

A good way of doing this type of thing is to use the (AWESOME) VCR gem.

Here's an example of how you would do it:

require 'vcr'
require 'mechanize'

# Setup VCR's configs.  The cassette library directory is where 
# all of your "recordings" are saved as YAML files.  
VCR.configure do |c|
  c.cassette_library_dir = 'vcr_cassettes'
  c.hook_into :webmock
end

# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
  a = Mechanize.new
  a.get('http://google.com/')
end

As you can see... VCR records the communication as a YAML file on the first run:

mario$  find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml

If you want to have VCR create new versions of the cassettes, just delete the corresponding file.

薄荷→糖丶微凉 2024-11-07 05:18:39

我不确定缓存页面是否会有那么大的帮助。更有帮助的是记录以前访问过的 URL,这样您就不会重复访问它们。页面缓存没有实际意义,因为当您第一次看到该页面时,您应该已经获取了重要信息,因此您需要做的就是检查是否已经看过它。如果有,请获取您关心的摘要信息并根据需要对其进行操作。

我曾经使用 Perl 的 Mechanize 编写分析蜘蛛。 Ruby 的 Mechanize 就是基于它的。将以前访问过的 URL 存储在某种类型的缓存中是有用的,就像哈希一样,但是,由于应用程序崩溃或主机在会话中途宕机,以前的所有结果都会消失。那时,一个真正的基于磁盘的数据库是必不可少的。

我喜欢 Postgres,但 SQLite 也是一个不错的选择。无论您使用什么,都可以获取有关驱动器的重要信息,使其能够在重新启动或崩溃后幸存下来。

我建议的其他内容是使用 YAML 文件来配置您的应用程序。将应用程序运行期间可能更改的每个参数放在那里。然后,编写应用程序,以便它定期检查该文件的修改时间,并在发生更改时重新加载它。这样,您就可以动态调整其运行时行为。几年前,我不得不编写一个蜘蛛程序来分析一家财富 50 强公司的多个网站。该应用程序运行了三周,抓取了与该公司相关的许多不同网站,并且因为我可以调整用于控制应用程序处理的页面的正则表达式,所以我可以在不关闭该应用程序的情况下对其进行微调。

I'm not sure that caching the pages is going to help that much. What will help more is to have a record of previously visited URLs so you don't revisit them repeatedly. The page caching is moot because you should have already grabbed the important information when you saw the page the first time so all you need to do is check to see if you've seen it already. If you have, grab the summary information you care about and manipulate it as necessary.

I used to write analytical spiders using Perl's Mechanize. Ruby's Mechanize is based on it. Storing the previously visited URLs in SOME sort of cache was useful, like a hash, but, because apps crash or hosts go down mid-session, all the previous results would be gone. A real disk-based database was essential at that point.

I like Postgres, but even SQLite is a good choice. Whatever you use, get the important information on the drive where it can survive a restart or crash.

Something else I'd recommend, is use a YAML file for configuration of your app. Put every parameter that is likely to be changed during the app's run in there. Then, write the app so it periodically checks that file's modification time and reloads it if there's been a change. That way, you can adjust its run-time behavior on the fly. I had to write a spider to analyze a Fortune 50 corporation's multiple-websites several years ago. The app ran for three weeks spidering many different sites tied to that corporation, and because I could tweak the regex used to control which pages the app processed, I could fine tune it without shutting down that app.

一紙繁鸢 2024-11-07 05:18:39

如果您在第一次请求后存储了有关页面的一些信息,则可以稍后重建页面,而无需从服务器重新请求它。

# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]

# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)

我在蜘蛛/抓取器中使用了这种技术,这样就可以调整代码而不必重新请求所有页面。例如:

# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
#    for in-memory storage, you could use a Hash.
#    or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)

  def get_cached(uri)
    cache_key = "_cache/#{uri}"

    if args = storage[cache_key]
      logger.debug("getting (cached) #{uri}")
      uri, response, body, code = args
      page = Mechanize::Page.new(uri, response, body, code, agent)
      agent.send(:add_to_history, page)
      page

    else
      logger.debug("getting (UNCACHED) #{uri}")
      page = agent.get(uri)
      storage[cache_key] = [page.uri, page.response, page.body, page.code]
      page

    end
  end

end

您可以像这样使用它:

require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'

storage = {}

foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")

pp storage

它打印以下信息:

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}

If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.

# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]

# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)

I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:

# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
#    for in-memory storage, you could use a Hash.
#    or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)

  def get_cached(uri)
    cache_key = "_cache/#{uri}"

    if args = storage[cache_key]
      logger.debug("getting (cached) #{uri}")
      uri, response, body, code = args
      page = Mechanize::Page.new(uri, response, body, code, agent)
      agent.send(:add_to_history, page)
      page

    else
      logger.debug("getting (UNCACHED) #{uri}")
      page = agent.get(uri)
      storage[cache_key] = [page.uri, page.response, page.body, page.code]
      page

    end
  end

end

Which you could use like this:

require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'

storage = {}

foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")

pp storage

Which prints the following information:

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}
裸钻 2024-11-07 05:18:39

将页面写入文件,将每个页面写入一个单独的文件,并将调整和运行周期分开怎么样?

How about writing pages out to files, each page in an individual file, and separating the tweak and run cycles?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文