用铁轨进行刮擦的耙子任务
我开始编写抓取工具来从不同的网站获取数据。我在 rake 文件中构建了第一个抓取工具,现在开始编写第二个 rake 文件以从第二个站点获取数据。目前,我正在为我感兴趣的每个网站编写一个特定的抓取工具(而不是尝试构建通用抓取工具)。
我有 3 个问题:
编写 Rake 任务对我来说是一个不错的选择吗?我应该考虑其他选择吗?
如何将函数/方法添加到我的 rake 文件中? (抱歉,非常愚蠢的问题,但我不知道如何构建我的代码......所以现在它只是一个长方法中的 500 行不间断代码)例如,我想要一个“get_description(section) " 从页面返回描述的方法。该方法可能会有所不同,具体取决于我要抓取的网站。
如何使用 RSpec 测试我的任务?我想提供一个链接并确保我的任务输出与我期望得到的结果相符
感谢您的帮助!
I'm starting to write scrapers to get data from different websites. I built the first scraper in a rake file and am now starting to write a second rake file to get data from a second site. For now, I am writing a scraper specific to each site I'm interested in (not trying to build a generic scraper).
I have 3 questions:
Is writing rake tasks a good choice for me? Are there alternatives I should consider?
How can I add functions/methods to my rake files? (sorry, very silly questions, but I can't figure out how to structure my code... so for now it's just 500 lines of uninterrupted code in a long method) for instance, I'd like a "get_description(section)" method that returns the description from the page. The method could be different depending on which site I'm scraping.
How can I test my task with RSpec? I'd like to give a link and make sure the output of my tasks matches what I expect to get
Thanks for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
作为一般原则,rake 任务应该非常少。将实际行为参考真实的类。然后可以轻松测试这些类。
示例:
正如 @brad 所指出的,您可以使用 thor,它本身具有常规的类结构,因此理论上应该更容易测试任务本身。不过我还没有这样做。
你可以在 rake 中定义方法,但我不知道它们最终会在哪里。你不应该这样做,所以别打扰。保持任务主体最小化,编写普通代码来完成肮脏的工作。
As a general principle, rake tasks should be very minimal. Refer the actual behavior to real classes. These classes can then be easily tested.
Example:
You could, as @brad indicated, use thor, which has a regular class structure by itself, so in theory it should be easier to test the tasks themselves. I haven't done that though.
You can define methods in rake, but I don't know where they end up. You shouldn't do that, so don't bother. Keep task bodies minimal, write normal code to do the dirty work.
当然,如果你想使用 rake 也可以,你还可以查看 thor ,它使用更多标准的类似 ruby 的语法,而不是 dsl rake 为您提供的。
Rake 只是另一个 ruby 库,因此您可以在其中包含您喜欢的任何内容。因此,您可以编写自己的库并将其加载到您的 rake 文件中。了解 Bundler 如何实现实例。他们刚刚定义了自己的类,然后在其中创建了任务。顺便说一句,它使用了 thor,从我收集到的信息来看,它以某种方式将这些任务代理到 rake 上,但还没有真正彻底地研究过它,所以我可能是错的。
Sure rake is fine if you want to use it, you can also check out thor which uses more standard ruby-like syntax rather than the dsl rake provides you.
Rake is just another ruby library so you can include whatever you like in there. As such you can write your own library and load it in your rake file. Check out how Bundler does it for instance. They've just defined their own classes, then created tasks inside of it. It uses thor by the way, which, from what I can gather somehow proxies those tasks on to rake, haven't really looked through it thoroughly though so i could be wrong.
If you're defining things in your own library, just use rspec as you normally would for any other project, then hook that library into rake or thor with whatever means and you're off to the races