如何在 Ruby 中抓取、解析和抓取文件?
我有许多来自数据仓库的数据文件需要处理,这些文件具有以下格式:
:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## ##### ######## ####### ###afp## ##e###")
数据由空格分隔,并且包含数字和其他 ASCII 字符。其中一些数据将被拆分并变得更有意义。
所有数据都将进入数据库,最初是用于开发的 SQLite 数据库,然后推送到另一个更永久的存储中。
这些文件实际上将通过 HTTP 从远程服务器拉入,我将不得不爬行一点才能获取其中的一些文件,因为它们跨越文件夹和许多文件。
我希望获得一些关于以“Ruby 方式”完成此任务的最佳工具和方法的信息,并抽象出其中的一些内容。否则,我可能会像在 Perl 或我之前采取的其他此类方法中那样处理它。
我正在考虑使用 OpenURI 打开每个 url,然后如果输入是 HTML 则收集要抓取的链接,否则处理数据。我将使用 String.scan 每次将文件适当地分解为多维数组,并根据数据提供者建立的格式解析每个组件。完成后,将数据推送到数据库中。转到下一个输入文件/uri。冲洗并重复。
我想我一定缺少一些库,那些有更多经验的人会使用这些库来显着清理/加速这个过程,并使脚本更加灵活,以便在其他数据集上重用。
此外,我将绘制和可视化这些数据并生成报告,所以也许也应该考虑这一点。
是否有任何关于更好的方法或库的意见来简单地实现这一点?
I have a number of data files to process from a data warehouse that have the following format:
:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## ##### ######## ####### ###afp## ##e###")
The data is separated by white spaces and has both numbers and other ASCII chars. Some of those pieces of data will be split up and made more meaningful.
All of the data will go into a database, initially an SQLite db for development, and then pushed up to another, more permanent, storage.
These files will be pulled in actually via HTTP from the remote server and I will have to crawl a bit to get some of it as they span folders and many files.
I was hopeful to get some input what the best tools and methods may be to accomplish this the "Ruby way", as well as to abstract out some of this. Otherwise, I'll tackle it probably similar to how I would in Perl or other such approaches I've taken before.
I was thinking along the lines of using OpenURI
to open each url, then if input is HTML collect links to crawl, otherwise process the data. I would use String.scan
to break apart the file appropriately each time into a multi-dimensional array parsing each component based on the established formatting by the data provider. Upon completion, push the data into the database. Move on to next input file/uri. Rinse and repeat.
I figure I must be missing some libs that those with more experience would use to clean/quicken this process up dramatically and make the script much more flexible for reuse on other data sets.
Additionally, I will be graphing and visualizing this data as well as generating reports, so perhaps that should too be considered.
Any input as to perhaps a better approach or libs to simply this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的问题主要集中在“低级”细节上——解析 URL 等等。 “Ruby Way”的一个关键方面是“不要重新发明轮子”。利用现有的库。 :)
我的推荐?首先,利用爬虫程序,例如 spider 或 海葵.其次,使用 Nokogiri 进行 HTML/XML 解析。第三,存储结果。我建议这样做,因为您稍后可能会进行不同的分析,并且您不想放弃蜘蛛抓取的辛苦工作。
在不太了解您的限制的情况下,我会考虑将您的结果存储在 MongoDB 中。想完之后,我进行了快速搜索,发现了一个不错的教程 使用 Anemone 和 MongoDB 抓取博客。
Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)
My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.
Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.
我已经编写了大约无数的蜘蛛和站点分析器,并发现 Ruby 有一些很好的工具,应该可以使这个过程变得简单。
OpenURI
可以轻松检索页面。URI.extract
方便查找页面中的链接。来自文档:简单的、未经测试的逻辑开始可能看起来像这样:
除非你拥有你正在爬行的网站,否则你要保持友善和温柔:不要跑得尽可能快,因为它不是你的管道。请注意网站的
robot.txt
文件,否则有被禁止的风险。Ruby 有真正的网络爬虫宝石,但基本任务非常简单,我从不关心它们。如果您想查看其他替代方案,请访问右侧的一些链接,了解有关此主题的其他问题。
如果您需要更多功能或灵活性,Nokogiri gem 可以轻松解析 HTML,允许您使用 CSS 访问器来搜索感兴趣的标签。有一些非常强大的功能可以让您轻松抓取页面,例如 typhoeus。
最后,虽然在一些评论中推荐的 ActiveRecord 很好,但在 Rails 之外查找使用它的文档可能会很困难或令人困惑。我建议使用 Sequel。它是一个很棒的 ORM,非常灵活,并且有很好的文档记录。
I've written probably a bajillion spiders and site analyzers and find that Ruby has some nice tools that should make this an easy process.
OpenURI
makes it easy to retrieve pages.URI.extract
makes it easy to find links in pages. From the docs:Simple, untested, logic to start might look like:
Unless you own the site you're crawling, you want to be kind and gentle: Don't run as fast as possible because it's not your pipe. Pay attention to the site's
robot.txt
file or risk being banned.There are true web-crawler gems for Ruby, but the basic task is so simple I never bother with them. If you want to check out other alternatives, visit some of the links to the right for other questions on SO that touch on this subject.
If you need more power or flexibility, the Nokogiri gem makes short work of parsing HTML, allowing you to use CSS accessors to search for tags of interest. There are some pretty powerful gems for making it easy to grab pages such as typhoeus.
Finally, while ActiveRecord, which is recommended in some comments, is nice, finding documentation for using it outside of Rails can be difficult or confusing. I recommend using Sequel. It is a great ORM, very flexible, and well documented.
嗨,在启动任何基本的 open-uri 内容之前,我首先要仔细看看名为 Mechanize 的 gem - 因为它是机械化的。它是一个出色、快速且易于使用的自动化网络爬行的宝石。由于您的数据格式非常奇怪(至少与 json、xml 或 html 相比),我认为您不会使用内置解析器 - 但您仍然可以看一下它。它被称为 nokogiri 并且非常聪明。但最后,在爬行和获取资源之后,您可能必须使用一些好的旧正则表达式东西。
祝你好运!
Hi I would start by taking a very close look at the gem called Mechanize before firing up any basic open-uri stuff - cause it's build into mechanize. It's a brilliant, fast, and easy to use gem for automating web-crawling. Since your data-format is pretty strange (at least compared to json, xml or html) I don't think you will make any use of the build-in parser - but you could still take a look at it. it's called nokogiri and is extremely smart as well. But in the last end, after crawling and fetching the resources, you will probably have to go with some good old regular expression stuff.
Good luck!