如何在 Ruby 中抓取、解析和抓取文件？

发布于 2024-12-08 22:16:15 字数 806 浏览 0 评论 0原文

我有许多来自数据仓库的数据文件需要处理，这些文件具有以下格式：

:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## #####   ########  ####### ###afp##      ##e###")

数据由空格分隔，并且包含数字和其他 ASCII 字符。其中一些数据将被拆分并变得更有意义。

所有数据都将进入数据库，最初是用于开发的 SQLite 数据库，然后推送到另一个更永久的存储中。

这些文件实际上将通过 HTTP 从远程服务器拉入，我将不得不爬行一点才能获取其中的一些文件，因为它们跨越文件夹和许多文件。

我希望获得一些关于以“Ruby 方式”完成此任务的最佳工具和方法的信息，并抽象出其中的一些内容。否则，我可能会像在 Perl 或我之前采取的其他此类方法中那样处理它。

我正在考虑使用 OpenURI 打开每个 url，然后如果输入是 HTML 则收集要抓取的链接，否则处理数据。我将使用 String.scan 每次将文件适当地分解为多维数组，并根据数据提供者建立的格式解析每个组件。完成后，将数据推送到数据库中。转到下一个输入文件/uri。冲洗并重复。

我想我一定缺少一些库，那些有更多经验的人会使用这些库来显着清理/加速这个过程，并使脚本更加灵活，以便在其他数据集上重用。

此外，我将绘制和可视化这些数据并生成报告，所以也许也应该考虑这一点。

是否有任何关于更好的方法或库的意见来简单地实现这一点？

原文

I have a number of data files to process from a data warehouse that have the following format:

:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## #####   ########  ####### ###afp##      ##e###")

The data is separated by white spaces and has both numbers and other ASCII chars. Some of those pieces of data will be split up and made more meaningful.

All of the data will go into a database, initially an SQLite db for development, and then pushed up to another, more permanent, storage.

These files will be pulled in actually via HTTP from the remote server and I will have to crawl a bit to get some of it as they span folders and many files.

I was hopeful to get some input what the best tools and methods may be to accomplish this the "Ruby way", as well as to abstract out some of this. Otherwise, I'll tackle it probably similar to how I would in Perl or other such approaches I've taken before.

I was thinking along the lines of using OpenURI to open each url, then if input is HTML collect links to crawl, otherwise process the data. I would use String.scan to break apart the file appropriately each time into a multi-dimensional array parsing each component based on the established formatting by the data provider. Upon completion, push the data into the database. Move on to next input file/uri. Rinse and repeat.

I figure I must be missing some libs that those with more experience would use to clean/quicken this process up dramatically and make the script much more flexible for reuse on other data sets.

Additionally, I will be graphing and visualizing this data as well as generating reports, so perhaps that should too be considered.

Any input as to perhaps a better approach or libs to simply this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嗫嚅 2024-12-15 22:16:15

您的问题主要集中在“低级”细节上——解析 URL 等等。 “Ruby Way”的一个关键方面是“不要重新发明轮子”。利用现有的库。 :)

我的推荐？首先，利用爬虫程序，例如 spider 或海葵.其次，使用 Nokogiri 进行 HTML/XML 解析。第三，存储结果。我建议这样做，因为您稍后可能会进行不同的分析，并且您不想放弃蜘蛛抓取的辛苦工作。

在不太了解您的限制的情况下，我会考虑将您的结果存储在 MongoDB 中。想完之后，我进行了快速搜索，发现了一个不错的教程使用 Anemone 和 MongoDB 抓取博客。

回复收藏 0 原文

装迷糊 2024-12-15 22:16:15

我已经编写了大约无数的蜘蛛和站点分析器，并发现 Ruby 有一些很好的工具，应该可以使这个过程变得简单。

OpenURI 可以轻松检索页面。

URI.extract 方便查找页面中的链接。来自文档：

描述
从字符串中提取 URI。如果给定了 block，则迭代所有匹配的 URI。如果给定块或数组匹配，则返回 nil。

  require "uri"

  URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.")
  # => ["http://foo.example.com/bla", "mailto:[email protected]"]

简单的、未经测试的逻辑开始可能看起来像这样：

require "openuri"
require "uri"

urls_to_scan = %w[
  http://www.example.com/page1
  http://www.example.com/page2
]

loop do
  break if urls_to_scan.empty?
  url = urls_to_scan.shift
  html = open(url).read

  # you probably want to do something to make sure the URLs are not
  # pointing outside the site you're walking.
  #
  # Something like:
  # 
  #     URI.extract(html).select{ |u| u[%r{^http://www\.example\.com}i] }
  #
  new_urls = URI.extract(html)

  if (new_urls.any?)
    urls_to_scan += new_urls
  else
    ; # parse your file as data using the content in html
  end
end

除非你拥有你正在爬行的网站，否则你要保持友善和温柔：不要跑得尽可能快，因为它不是你的管道。请注意网站的 robot.txt 文件，否则有被禁止的风险。

Ruby 有真正的网络爬虫宝石，但基本任务非常简单，我从不关心它们。如果您想查看其他替代方案，请访问右侧的一些链接，了解有关此主题的其他问题。

如果您需要更多功能或灵活性，Nokogiri gem 可以轻松解析 HTML，允许您使用 CSS 访问器来搜索感兴趣的标签。有一些非常强大的功能可以让您轻松抓取页面，例如 typhoeus。

最后，虽然在一些评论中推荐的 ActiveRecord 很好，但在 Rails 之外查找使用它的文档可能会很困难或令人困惑。我建议使用 Sequel。它是一个很棒的 ORM，非常灵活，并且有很好的文档记录。

I've written probably a bajillion spiders and site analyzers and find that Ruby has some nice tools that should make this an easy process.

OpenURI makes it easy to retrieve pages.

URI.extract makes it easy to find links in pages. From the docs:

Description
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.

  require "uri"

  URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.")
  # => ["http://foo.example.com/bla", "mailto:[email protected]"]

Simple, untested, logic to start might look like:

require "openuri"
require "uri"

urls_to_scan = %w[
  http://www.example.com/page1
  http://www.example.com/page2
]

loop do
  break if urls_to_scan.empty?
  url = urls_to_scan.shift
  html = open(url).read

  # you probably want to do something to make sure the URLs are not
  # pointing outside the site you're walking.
  #
  # Something like:
  # 
  #     URI.extract(html).select{ |u| u[%r{^http://www\.example\.com}i] }
  #
  new_urls = URI.extract(html)

  if (new_urls.any?)
    urls_to_scan += new_urls
  else
    ; # parse your file as data using the content in html
  end
end

Unless you own the site you're crawling, you want to be kind and gentle: Don't run as fast as possible because it's not your pipe. Pay attention to the site's robot.txt file or risk being banned.

There are true web-crawler gems for Ruby, but the basic task is so simple I never bother with them. If you want to check out other alternatives, visit some of the links to the right for other questions on SO that touch on this subject.

If you need more power or flexibility, the Nokogiri gem makes short work of parsing HTML, allowing you to use CSS accessors to search for tags of interest. There are some pretty powerful gems for making it easy to grab pages such as typhoeus.

Finally, while ActiveRecord, which is recommended in some comments, is nice, finding documentation for using it outside of Rails can be difficult or confusing. I recommend using Sequel. It is a great ORM, very flexible, and well documented.

回复收藏 0 原文

二手情话 2024-12-15 22:16:15

嗨，在启动任何基本的 open-uri 内容之前，我首先要仔细看看名为 Mechanize 的 gem - 因为它是机械化的。它是一个出色、快速且易于使用的自动化网络爬行的宝石。由于您的数据格式非常奇怪（至少与 json、xml 或 html 相比），我认为您不会使用内置解析器 - 但您仍然可以看一下它。它被称为 nokogiri 并且非常聪明。但最后，在爬行和获取资源之后，您可能必须使用一些好的旧正则表达式东西。

祝你好运！

回复收藏 0 原文

~没有更多了~