如何同时运行多个 nokogiri 屏幕抓取线程
我有一个网站,需要在许多不同的网站上使用 Nokogiri 来提取数据。这个过程使用delayed_job gem 作为后台作业运行。然而,每个页面运行大约需要 3-4 秒,因为它必须暂停并等待其他网站响应。 我目前只是运行它们,基本上是说
Websites.all.each do |website|
# screen scrape
end
我想分批执行它们,而不是每次执行它们,这样我就不必等待每个站点的服务器响应(有时可能需要长达 20 秒)。
最好的 ruby 或 Rails 方法是什么?
提前感谢您的帮助。
I have a website that requires using Nokogiri on many different websites to extract data. This process is ran as a background job using the delayed_job gem. However it takes around 3-4 seconds per page to run because it has to pause and wait for other websites to respond.
I am currently just running them by basically saying
Websites.all.each do |website|
# screen scrape
end
I would like to execute them in batches rather than one each so that I dont have to wait for a server response from every site (can take up to 20 seconds on occassion).
What would be the best ruby or rails way to do this?
Thanks for your help in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能需要查看 Typhoeus,它使您能够发出并行 http 请求。
我发现这里有一篇关于与 Nokogiri 一起使用,但我自己还没有尝试过。
包含在 DJ 中,这应该可以在客户端延迟很小的情况下完成。
You might want to check out Typhoeus which enables you to make parallel http requests.
I found a short blawg post here about using it with Nokogiri, but I haven't tried this myself.
Wrapped in a DJ, this should do the trick with little client-side latency.
您需要使用延迟作业。看看这个 Railscasts。
请记住,大多数房东都会对此类事情收费。
如果您不关心管理线程,您也可以使用 spawn 插件,但它要容易得多! !!
这实际上是您需要做的所有事情:
rails plugin/install https://github.com/tra /spawn.git
例如:
http://railscasts.com/episodes/171-delayed-job
https://github.com /tra/spawn
You need to use delayed job. Check out this Railscasts.
Keep in mind most hosts charge for this type of thing.
You can also use the spawn plugin if you don't care about managing threads but it is much much easier!!!
This is literally all you need to do:
rails plugin/install https://github.com/tra/spawn.git
For example:
http://railscasts.com/episodes/171-delayed-job
https://github.com/tra/spawn
我正在使用 EventMachine 对当前项目执行类似的操作。有一个很棒的插件,名为 em-http-request,它允许您并行发出多个 HTTP 请求,并提供同步响应的选项。
来自 em-http-request github 文档:
所以在你的在这种情况下,您可以
使用瘦网络服务器运行您的 Rails 应用程序,以获得有效的 EventMachine 循环:
您还需要 eventmachine 和 em-http-request gems。祝你好运!
I'm using EventMachine to do something similar to this for a current project. There is a terrific plugin called em-http-request that allows you to make mutliple HTTP requests in parallel, as well as providing options for synchronising the responses.
From the em-http-request github docs:
So in your case, you could have
Run your rails application with the thin webserver in order to get a functioning EventMachine loop:
You'll also need the eventmachine and em-http-request gems. Good luck!