为什么 Ruby 中的curl 比命令行curl 慢?
我正在尝试下载超过 100 万个页面(以序列 ID 结尾的 URL)。我已经实现了一种多功能下载管理器,具有可配置数量的下载线程和一个处理线程。下载器批量下载文件:
curl = Curl::Easy.new
batch_urls.each { |url_info|
curl.url = url_info[:url]
curl.perform
file = File.new(url_info[:file], "wb")
file << curl.body_str
file.close
# ... some other stuff
}
我尝试下载8000页样本。当使用上面的代码时,我在 2 分钟内得到了 1000。当我将所有 URL 写入文件并在 shell 中执行时:
cat list | xargs curl
我在两分钟内生成了所有 8000 个页面。
问题是,我需要它在 ruby 代码中包含它,因为还有其他监视和处理代码。
我尝试过:
- Curl::Multi - 它在某种程度上更快,但错过了 50-90% 的文件(不下载它们并且没有给出原因/代码)
- Curl::Easy 的多线程 - 与单线程的速度大致相同
为什么重用 Curl::Easy 比后续命令行curl 调用慢,如何使其更快?或者我做错了什么?
我更愿意修复我的下载管理器代码,而不是以不同的方式进行这种情况的下载。
在此之前,我调用了命令行 wget,我提供了一个包含 URL 列表的文件。但是,并非所有错误都得到处理,而且使用 URL 列表时也不可能为每个 URL 单独指定输出文件。
现在在我看来,最好的方法是使用多个线程以及对“curl”命令的系统调用。但是为什么我可以在 Ruby 中直接使用 Curl 呢?
下载管理器的代码在这里,如果它可能有帮助的话: 下载管理器(我玩过超时,从不将其设置为各种值,它似乎没有帮助)
任何提示表示赞赏。
I am trying to download more than 1m pages (URLs ending by a sequence ID). I have implemented kind of multi-purpose download manager with configurable number of download threads and one processing thread. The downloader downloads files in batches:
curl = Curl::Easy.new
batch_urls.each { |url_info|
curl.url = url_info[:url]
curl.perform
file = File.new(url_info[:file], "wb")
file << curl.body_str
file.close
# ... some other stuff
}
I have tried to download 8000 pages sample. When using the code above, I get 1000 in 2 minutes. When I write all URLs into a file and do in shell:
cat list | xargs curl
I gen all 8000 pages in two minutes.
Thing is, I need it to have it in ruby code, because there is other monitoring and processing code.
I have tried:
- Curl::Multi - it is somehow faster, but misses 50-90% of files (does not download them and gives no reason/code)
- multiple threads with Curl::Easy - around the same speed as single threaded
Why is reused Curl::Easy slower than subsequent command line curl calls and how can I make it faster? Or what I am doing wrong?
I would prefer to fix my download manager code than to make downloading for this case in a different way.
Before this, I was calling command-line wget which I provided with a file with list of URLs. Howerver, not all errors were handled, also it was not possible to specify output file for each URL separately when using URL list.
Now it seems to me that the best way would be to use multiple threads with system call to 'curl' command. But why when I can use directly Curl in Ruby?
Code for the download manager is here, if it might help: Download Manager (I have played with timeouts, from not-setting it to various values, it did not seem help)
Any hints appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这可能是一个适合 Typhoeus 的任务,
像这样(未经测试):
想想看,你可能由于文件数量巨大而出现内存问题。防止这种情况的一种方法是永远不要将数据存储在变量中,而是直接将其流式传输到文件中。您可以使用 em-http-request 。
This could be a fitting task for Typhoeus
Something like this (untested):
Come to think of it, you might get a memory problem because of the enormous amout of files. One way to prevent that would be to never store the data in a variable but instead stream it to the file directly. You could use em-http-request for that.
因此,如果您没有设置 on_body 处理程序,那么遏制将缓冲下载。如果您正在下载文件,则应该使用 on_body 处理程序。如果您想使用 Ruby Curl 下载多个文件,请尝试 Curl::Multi.download 接口。
如果您只想下载单个文件。
这是一个很好的资源:http://gist.github.com/405779
So, if you don't set a on_body handler than curb will buffer the download. If you're downloading files you should use an on_body handler. If you want to download multiple files using Ruby Curl, try the Curl::Multi.download interface.
If you want to just download a single file.
Here is a good resource: http://gist.github.com/405779
已经完成了基准测试,将遏制与其他方法(例如 HTTPClient)进行了比较。几乎所有类别的获胜者都是 HTTPClient。另外,有一些记录在案的场景,限制在多线程场景中不起作用。
我和你一样,也有过你的经历。我在 20 多个并发线程中运行了curl 系统命令,它比在 20 多个并发线程中运行curb 快了10 倍。无论我尝试什么,情况总是如此。
从那时起我就切换到 HTTPClient,差别是巨大的。现在,它的运行速度与 20 个并发的 curl 系统命令一样快,并且使用的 CPU 也更少。
There's been benchmarks done that has compared curb with other methods such as HTTPClient. The winner, in almost all categories was HTTPClient. Plus, there have been some documented scenarios where curb does NOT work in multi-threading scenarios.
Like you, I've had your experience. I ran system commands of curl in 20+ concurrent threads and it was 10 X fasters than running curb in 20+ concurrent threads. No matter, what I tried, this was always the case.
I've since then switched to HTTPClient, and the difference is huge. Now it runs as fast as 20 concurrent curl system commands, and uses less CPU as well.
首先我要说的是我对 Ruby 几乎一无所知。
我所知道的是 Ruby 是一种解释性语言;毫不奇怪,它比为特定平台编译的高度优化的代码慢。每个文件操作可能都会进行检查,而
curl
则不会。 “其他一些东西”会让事情变得更加缓慢。您是否尝试过分析代码以了解大部分时间都花在哪里了?
First let me say that I know almost nothing about Ruby.
What I do know is that Ruby is an interpreted language; it's not surprising that it's slower than heavily optimised code that's been compiled for a specific platform. Every file operation will probably have checks around it that
curl
doesn't. The "some other stuff" will slow things down even more.Have you tried profiling your code to see where most of the time is being spent?
Stiivi,
Net::HTTP 足以进行简单下载HTML 页面?
Stiivi,
any chance that Net::HTTP would suffice for simple downloading of HTML pages?
您没有指定 Ruby 版本,但 1.8.x 中的线程是用户空间线程,不是由操作系统调度的,因此整个 Ruby 解释器仅使用一个 CPU/核心。最重要的是,有一个全局解释器锁,可能还有其他锁,会干扰并发性。由于您正在尝试最大化网络吞吐量,因此您可能没有充分利用 CPU。
生成与机器内存一样多的进程,并限制对线程的依赖。
You didn't specify a Ruby version, but threads in 1.8.x are user-space threads, not scheduled by the OS, so the entire Ruby interpreter only ever use one CPU/core. On top of that there is a Global Interpreter Lock, and probably other locks as well, interfering with concurrency. Since you're trying to maximize network throughput, you're probably underutilizing CPUs.
Spawn as many processes as the machine has memory for, and limit the reliance on threads.