在R中下载多个文件的更快方法
我在R中写下一个小的下载器,以便从远程服务器下载一些日志文件:
file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"
downloader <- function(file_remote, file_local, credentials) {
data_bin <- RCurl::getBinaryURL(
file_remote,
userpwd = credentials,
ftp.use.epsv = FALSE,
forbid.reuse = TRUE
)
writeBin(data_bin, file_local)
}
purrr::walk2(
file_remote,
file_local,
~ downloader(
file_remote = .x,
file_local = .y,
credentials = credentials
)
)
这可以工作,但慢慢地将其与某些FTP客户端进行比较,例如WinScp,下载64个日志文件,每个2KB,需要分钟。
是否有更快的方法可以在R中下载大量文件?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
curl
软件包具有执行异步请求的方法,这意味着同时执行下载而不是一个接一个地执行下载。尤其是使用较小的文件,这应该可以使您大大提高性能。这是执行此操作的准排骨函数(以来,版本5.0.0 ,curl
package具有本函数的本机版本也称为> Multi_download
):现在我们需要一些测试文件将其与您的基线方法进行比较。我使用Johns Hopkins University GitHub页面的COVID数据,因为它包含许多与您的文件相似的小型CSV文件。
我们也可以从URL中推断文件名,但我认为这不是您想要的。因此,现在让我们比较这些821个文件的方法:
新方法的速度比原始文件快13.3倍。我认为差异会更大,而您拥有的文件越多。但是请注意,由于我的互联网速度波动很大,因此该基准并不完美。
该函数也应在处理错误方面得到改进(当前您会收到一条消息,有多少请求成功,有多少个错误,但没有指示存在哪些文件)。我的理解也是,
multi_run
在save_download
将它们写入磁盘之前将文件写入内存。对于小文件来说,这很好,但是较大的文件可能是一个问题。基线函数
在2022-06-05创建的 reprex package (v2.0.1)
The
curl
package has a way to perform async requests, which means that downloads are performed simultaneously instead of one after another. Especially with smaller files this should give you a large boost in performance. Here is a barebone function that does that (since version 5.0.0, thecurl
package has a native version of this function also calledmulti_download
):Now we need some test files to compare it to your baseline approach. I use covid data from the Johns Hopkins University GitHub page as it contains many small csv files which should be similar to your files.
We could also infer the file names from the URLs but I assume that is not what you want. So now lets compare the approaches for these 821 files:
The new approach is 13.3 times faster than the original one. I would assume that the difference will be bigger the more files you have. Note though, that this benchmark is not perfect as my internet speed fluctuates quite a bit.
The function should also be improved in terms of handling errors (currently you get a message how many requests have been successful and how many errored, but no indication which files exist). My understanding is also that
multi_run
writes files to the memory beforesave_download
writes them to disk. With small files this is fine, but it might be an issue with larger ones.baseline function
Created on 2022-06-05 by the reprex package (v2.0.1)