Rails 3 中的屏幕抓取

发布于 2024-12-28 05:28:50 字数 83 浏览 4 评论 0原文

Rails 3 - gem/library 中的屏幕抓取选项有哪些?我过去曾使用过 Nokogiri,但只是想知道 Rails 3 中是否有更好的选择。

What are the screen scraping options in Rails 3 - gem/library? I have used Nokogiri in the past but just wanted to know if there are better options in Rails 3.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

如歌彻婉言 2025-01-04 05:28:50

如果这是一项一次性任务,或者您的目标数据集相对较小(不到一百页),请使用 Mechanize (浏览和抓取)或 Anemone(执行 Mechanize 执行的任何操作 + 一些额外的特定于爬行的操作 选项)。

如果您需要自动执行此收集操作或者正在处理大型数据集,请考虑使用 Web 服务。 Bobik 是这个方面的一个不错的选择。

If this is a one-off task or if your target data set is relatively small (under a hundred of pages), use Mechanize (browse & scrape) or Anemone (does whatever Mechanize does + some additional crawling-specific options).

If you need to automate this collection or if you are dealing with large data sets, consider using a web service. Bobik is a good choice in this bucket.

晨敛清荷 2025-01-04 05:28:50

Rails 不进行屏幕抓取。您可以自由地使用 Ruby 代码来添加该功能,但它本身会生成页面。

Mechanize 在内部使用 Nokogiri,是一个不错的选择,否则我总是使用 Nokogiri 和 OpenURI 自己开发。

Rails doesn't do screen scraping. You are free to use Ruby code that would add that functionality, but by itself it does the generation of the pages.

Mechanize, which uses Nokogiri internally, is a good choice, otherwise I always roll my own using Nokogiri and OpenURI.

夜血缘 2025-01-04 05:28:50

在神奇的 RubyTools 网站中,您可以找到几个 用于解析 HTML 的 Ruby 库。仍然最受欢迎的是 Nokogiri。

In the fantastic RubyTools website you can find several Ruby libraries to parsing HTML. Still Nokogiri is the most popular.

傾城如夢未必闌珊 2025-01-04 05:28:50

您还可以使用 Scrapifier gem 从字符串中找到的 URI 获取元数据。使用起来非常简单:

'Wow! What an awesome site: http://adtangerine.com!'.scrapify

 #=> {
 #   title:       "AdTangerine | Advertising Platform for Social Media",
 #   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
 #   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
 #   uri:         "http://adtangerine.com"
 # }

You can also use the Scrapifier gem to get metadata from URIs found in a string. It's very simple to use:

'Wow! What an awesome site: http://adtangerine.com!'.scrapify

 #=> {
 #   title:       "AdTangerine | Advertising Platform for Social Media",
 #   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
 #   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
 #   uri:         "http://adtangerine.com"
 # }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文