如何同时运行多个 nokogiri 屏幕抓取线程

发布于 2024-10-25 00:04:30 字数 320 浏览 4 评论 0原文

我有一个网站,需要在许多不同的网站上使用 Nokogiri 来提取数据。这个过程使用delayed_job gem 作为后台作业运行。然而,每个页面运行大约需要 3-4 秒,因为它必须暂停并等待其他网站响应。 我目前只是运行它们,基本上是说

Websites.all.each do |website|
  # screen scrape
end

我想分批执行它们,而不是每次执行它们,这样我就不必等待每个站点的服务器响应(有时可能需要长达 20 秒)。

最好的 ruby​​ 或 Rails 方法是什么?

提前感谢您的帮助。

I have a website that requires using Nokogiri on many different websites to extract data. This process is ran as a background job using the delayed_job gem. However it takes around 3-4 seconds per page to run because it has to pause and wait for other websites to respond.
I am currently just running them by basically saying

Websites.all.each do |website|
  # screen scrape
end

I would like to execute them in batches rather than one each so that I dont have to wait for a server response from every site (can take up to 20 seconds on occassion).

What would be the best ruby or rails way to do this?

Thanks for your help in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

跨年 2024-11-01 00:04:30

您可能需要查看 Typhoeus,它使您能够发出并行 http 请求。

我发现这里有一篇关于与 Nokogiri 一起使用,但我自己还没有尝试过。

包含在 DJ 中,这应该可以在客户端延迟很小的情况下完成。

You might want to check out Typhoeus which enables you to make parallel http requests.

I found a short blawg post here about using it with Nokogiri, but I haven't tried this myself.

Wrapped in a DJ, this should do the trick with little client-side latency.

貪欢 2024-11-01 00:04:30

您需要使用延迟作业。看看这个 Railscasts

请记住,大多数房东都会对此类事情收费。

如果您不关心管理线程,您也可以使用 spawn 插件,但它要容易得多! !!

这实际上是您需要做的所有事情:

  1. rails plugin/install https://github.com/tra /spawn.git
  2. 然后在控制器或模型中添加方法

例如:

 spawn do
    #execute your code here :)
 end 

http://railscasts.com/episodes/171-delayed-job

https://github.com /tra/spawn

You need to use delayed job. Check out this Railscasts.

Keep in mind most hosts charge for this type of thing.

You can also use the spawn plugin if you don't care about managing threads but it is much much easier!!!

This is literally all you need to do:

  1. rails plugin/install https://github.com/tra/spawn.git
  2. Then in your controller or model add the method

For example:

 spawn do
    #execute your code here :)
 end 

http://railscasts.com/episodes/171-delayed-job

https://github.com/tra/spawn

桜花祭 2024-11-01 00:04:30

我正在使用 EventMachine 对当前项目执行类似的操作。有一个很棒的插件,名为 em-http-request,它允许您并行发出多个 HTTP 请求,并提供同步响应的选项。

来自 em-http-request github 文档:

EventMachine.run {
  http1 = EventMachine::HttpRequest.new('http://google.com/').get
  http2 = EventMachine::HttpRequest.new('http://yahoo.com/').get

  http1.callback { }
  http2.callback { } 
end

所以在你的在这种情况下,您可以

callbacks = []
Websites.all.each do |website|
    callbacks << EventMachine::HttpRequest.new(website.url).get
end

callbacks.each do |http|
    http.callback { }
end

使用瘦网络服务器运行您的 Rails 应用程序,以获得有效的 EventMachine 循环:

bundle exec rails server thin

您还需要 eventmachine 和 em-http-request gems。祝你好运!

I'm using EventMachine to do something similar to this for a current project. There is a terrific plugin called em-http-request that allows you to make mutliple HTTP requests in parallel, as well as providing options for synchronising the responses.

From the em-http-request github docs:

EventMachine.run {
  http1 = EventMachine::HttpRequest.new('http://google.com/').get
  http2 = EventMachine::HttpRequest.new('http://yahoo.com/').get

  http1.callback { }
  http2.callback { } 
end

So in your case, you could have

callbacks = []
Websites.all.each do |website|
    callbacks << EventMachine::HttpRequest.new(website.url).get
end

callbacks.each do |http|
    http.callback { }
end

Run your rails application with the thin webserver in order to get a functioning EventMachine loop:

bundle exec rails server thin

You'll also need the eventmachine and em-http-request gems. Good luck!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文