当前位置：文江博客话题详情

r rcurl r-download.file

在R中下载多个文件的更快方法

发布于 2025-02-01 04:25:25 字数 680 浏览 3 评论 0 原文

我在R中写下一个小的下载器，以便从远程服务器下载一些日志文件：

file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"

downloader <- function(file_remote, file_local, credentials) {
  data_bin <- RCurl::getBinaryURL(
    file_remote,
    userpwd = credentials,
    ftp.use.epsv = FALSE,
    forbid.reuse = TRUE
  )
  
  writeBin(data_bin, file_local)
}
  
purrr::walk2(
  file_remote,
  file_local,
  ~ downloader(
    file_remote = .x,
    file_local = .y,
    credentials = credentials
  )
)

这可以工作，但慢慢地将其与某些FTP客户端进行比较，例如WinScp，下载64个日志文件，每个2KB，需要分钟。

是否有更快的方法可以在R中下载大量文件？

原文

I write a small downloader in R, in order to download some log files from remote server in one run:

file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"

downloader <- function(file_remote, file_local, credentials) {
  data_bin <- RCurl::getBinaryURL(
    file_remote,
    userpwd = credentials,
    ftp.use.epsv = FALSE,
    forbid.reuse = TRUE
  )
  
  writeBin(data_bin, file_local)
}
  
purrr::walk2(
  file_remote,
  file_local,
  ~ downloader(
    file_remote = .x,
    file_local = .y,
    credentials = credentials
  )
)

This works, but slowly, especially compare it to some FTP clients like WinSCP, downloading 64 log files, each 2kb, takes minutes.

Is there a faster way to download a lot of files in R?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

朦胧时间 2025-02-08 04:25:25

curl 软件包具有执行异步请求的方法，这意味着同时执行下载而不是一个接一个地执行下载。尤其是使用较小的文件，这应该可以使您大大提高性能。这是执行此操作的准排骨函数（以来，版本5.0.0 ， curl package具有本函数的本机版本也称为 > Multi_download ）：

# total_con: max total concurrent connections.
# host_con: max concurrent connections per host.
# print: print status of requests at the end.
multi_download <- function(file_remote, 
                           file_local,
                           total_con = 1000L, 
                           host_con  = 1000L,
                           print = TRUE) {
  
  # check for duplication (deactivated for testing)
  # dups <- duplicated(file_remote) | duplicated(file_local)
  # file_remote <- file_remote[!dups]
  # file_local <- file_local[!dups]
  
  # create pool
  pool <- curl::new_pool(total_con = total_con,
                         host_con = host_con)
  
  # function performed on successful request
  save_download <- function(req) {
    writeBin(req$content, file_local[file_remote == req$url])
  }
  
  # setup async calls
  invisible(
    lapply(
      file_remote, function(f) 
        curl::curl_fetch_multi(f, done = save_download, pool = pool)
    )
  )
  
  # all created requests are performed here
  out <- curl::multi_run(pool = pool)
  
  if (print) print(out)
  
}

现在我们需要一些测试文件将其与您的基线方法进行比较。我使用Johns Hopkins University GitHub页面的COVID数据，因为它包含许多与您的文件相似的小型CSV文件。

file_remote <- paste0(
  "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",
  format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),
  ".csv"
)
file_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")

我们也可以从URL中推断文件名，但我认为这不是您想要的。因此，现在让我们比较这些821个文件的方法：

res <- bench::mark(
  baseline(),
  multi_download(file_remote, 
                 file_local,
                 print = FALSE),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
summary(res)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                                                min median `itr/sec`
#>   <bch:expr>                                             <bch:> <bch:>     <dbl>
#> 1 baseline()                                               2.8m   2.8m   0.00595
#> 2 multi_download(file_remote, file_local, print = FALSE)  12.7s  12.7s   0.0789 
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                                               min median `itr/sec`
#>   <bch:expr>                                             <dbl>  <dbl>     <dbl>
#> 1 baseline()                                              13.3   13.3       1  
#> 2 multi_download(file_remote, file_local, print = FALSE)   1      1        13.3
#> # … with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>

新方法的速度比原始文件快13.3倍。我认为差异会更大，而您拥有的文件越多。但是请注意，由于我的互联网速度波动很大，因此该基准并不完美。

该函数也应在处理错误方面得到改进（当前您会收到一条消息，有多少请求成功，有多少个错误，但没有指示存在哪些文件）。我的理解也是， multi_run 在 save_download 将它们写入磁盘之前将文件写入内存。对于小文件来说，这很好，但是较大的文件可能是一个问题。

基线函数

baseline <- function() {
  credentials <- "usr/pwd"
  downloader <- function(file_remote, file_local, credentials) {
    data_bin <- RCurl::getBinaryURL(
      file_remote,
      userpwd = credentials,
      ftp.use.epsv = FALSE,
      forbid.reuse = TRUE
    )
    writeBin(data_bin, file_local)
  }
  
  purrr::walk2(
    file_remote,
    file_local,
    ~ downloader(
      file_remote = .x,
      file_local = .y,
      credentials = credentials
    )
  )
}

^{在2022-06-05创建的 reprex package （v2.0.1）}

The curl package has a way to perform async requests, which means that downloads are performed simultaneously instead of one after another. Especially with smaller files this should give you a large boost in performance. Here is a barebone function that does that (since version 5.0.0, the curl package has a native version of this function also called multi_download):

# total_con: max total concurrent connections.
# host_con: max concurrent connections per host.
# print: print status of requests at the end.
multi_download <- function(file_remote, 
                           file_local,
                           total_con = 1000L, 
                           host_con  = 1000L,
                           print = TRUE) {
  
  # check for duplication (deactivated for testing)
  # dups <- duplicated(file_remote) | duplicated(file_local)
  # file_remote <- file_remote[!dups]
  # file_local <- file_local[!dups]
  
  # create pool
  pool <- curl::new_pool(total_con = total_con,
                         host_con = host_con)
  
  # function performed on successful request
  save_download <- function(req) {
    writeBin(req$content, file_local[file_remote == req$url])
  }
  
  # setup async calls
  invisible(
    lapply(
      file_remote, function(f) 
        curl::curl_fetch_multi(f, done = save_download, pool = pool)
    )
  )
  
  # all created requests are performed here
  out <- curl::multi_run(pool = pool)
  
  if (print) print(out)
  
}

Now we need some test files to compare it to your baseline approach. I use covid data from the Johns Hopkins University GitHub page as it contains many small csv files which should be similar to your files.

file_remote <- paste0(
  "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",
  format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),
  ".csv"
)
file_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")

We could also infer the file names from the URLs but I assume that is not what you want. So now lets compare the approaches for these 821 files:

res <- bench::mark(
  baseline(),
  multi_download(file_remote, 
                 file_local,
                 print = FALSE),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
summary(res)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                                                min median `itr/sec`
#>   <bch:expr>                                             <bch:> <bch:>     <dbl>
#> 1 baseline()                                               2.8m   2.8m   0.00595
#> 2 multi_download(file_remote, file_local, print = FALSE)  12.7s  12.7s   0.0789 
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                                               min median `itr/sec`
#>   <bch:expr>                                             <dbl>  <dbl>     <dbl>
#> 1 baseline()                                              13.3   13.3       1  
#> 2 multi_download(file_remote, file_local, print = FALSE)   1      1        13.3
#> # … with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>

The new approach is 13.3 times faster than the original one. I would assume that the difference will be bigger the more files you have. Note though, that this benchmark is not perfect as my internet speed fluctuates quite a bit.

The function should also be improved in terms of handling errors (currently you get a message how many requests have been successful and how many errored, but no indication which files exist). My understanding is also that multi_run writes files to the memory before save_download writes them to disk. With small files this is fine, but it might be an issue with larger ones.

baseline function

baseline <- function() {
  credentials <- "usr/pwd"
  downloader <- function(file_remote, file_local, credentials) {
    data_bin <- RCurl::getBinaryURL(
      file_remote,
      userpwd = credentials,
      ftp.use.epsv = FALSE,
      forbid.reuse = TRUE
    )
    writeBin(data_bin, file_local)
  }
  
  purrr::walk2(
    file_remote,
    file_local,
    ~ downloader(
      file_remote = .x,
      file_local = .y,
      credentials = credentials
    )
  )
}

^{Created on 2022-06-05 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~

关于作者

蒗幽

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

在R中下载多个文件的更快方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

基线函数

baseline function

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

在R中下载多个文件的更快方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

基线函数

baseline function

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。