从网页读取所有.tar.gz文件

发布于 2025-02-08 23:38:41 字数 335 浏览 1 评论 0 原文

我需要从

fn <- "https://www.ncei.noaa.gov/data/global-hourly/archive/csv/1901.tar.gz"
download.file(fn,destfile="tmp.tar.gz")
file_names <- untar("tmp.tar.gz",list=TRUE) 

ISDGlobalHourlyData <- do.call(dplyr::bind_rows,lapply(file_names,read.csv))

但是对于每个文件来说，单独执行此操作太耗时了。有没有一种方法可以用一个函数阅读它们？

原文

I need to read all the data from this webpage into an R dataframe. I have the code to read the first file:

fn <- "https://www.ncei.noaa.gov/data/global-hourly/archive/csv/1901.tar.gz"
download.file(fn,destfile="tmp.tar.gz")
file_names <- untar("tmp.tar.gz",list=TRUE) 

ISDGlobalHourlyData <- do.call(dplyr::bind_rows,lapply(file_names,read.csv))

but it's too time-consuming to do this individually for each one. Is there a way to read them all with one function?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅浅 2025-02-15 23:38:41

很大（＆gt; 4gb），因此您可能需要在连接时间之前增加秒数，并且删除tmp.tar.gz可能是一个好主意以及将其加载到R中后的CSV文件，但这取决于您。

另外，在运行此操作之前， dir_ls（）命令在您的工作目录中搜索名为'digits'.csv（例如“ 0296009999999.CSV”）中的所有.csv文件）;如果您具有具有相同命名方案的其他.CSV文件，则可能必须相应地更改正则命名（例如Digits'.csv的X X）。

这是一种可能合适的方法：

library(tidyverse)
library(fs)

options(timeout = 3600)

# For this demo, the 'dates_of_interest' are the years 1901-1906
# For the 'real' command, change this to "1901:2022" (or whatever)
dates_of_interest <- 1901:1906

ISDGlobalHourlyData <- list()

for (i in seq_along(dates_of_interest)) {
  fn <- paste0("https://www.ncei.noaa.gov/data/global-hourly/archive/csv/",
               dates_of_interest[i],
               ".tar.gz")
  download.file(fn, destfile = "tmp.tar.gz")
  untar("tmp.tar.gz")
  filelist <- dir_ls(regexp = "\\d+\\.csv")
  ISDGlobalHourlyData[[i]] <- do.call(dplyr::bind_rows,
                                      lapply(filelist, read.csv))
  # If you want to keep the files, delete these two lines:
  file.remove("tmp.tar.gz")
  file.remove(filelist)
}

names(ISDGlobalHourlyData) <- dates_of_interest
combined_ISDGlobalHourlyData <- bind_rows(ISDGlobalHourlyData, .id = "Year")

str(combined_ISDGlobalHourlyData)
#> 'data.frame':    38301 obs. of  20 variables:
#>  $ Year           : chr  "1901" "1901" "1901" "1901" ...
#>  $ STATION        : num  2.91e+09 2.91e+09 2.91e+09 2.91e+09 2.91e+09 ...
#>  $ DATE           : chr  "1901-01-01T06:00:00" "1901-01-01T13:00:00" "1901-01-01T20:00:00" "1901-01-02T06:00:00" ...
#>  $ SOURCE         : int  4 4 4 4 4 4 4 4 4 4 ...
#>  $ LATITUDE       : num  64.3 64.3 64.3 64.3 64.3 ...
#>  $ LONGITUDE      : num  23.4 23.4 23.4 23.4 23.4 ...
#>  $ ELEVATION      : num  5 5 5 5 5 5 5 5 5 5 ...
#>  $ NAME           : chr  "KALAJOKI ULKOKALLA, FI" "KALAJOKI ULKOKALLA, FI" "KALAJOKI ULKOKALLA, FI" "KALAJOKI ULKOKALLA, FI" ...
#>  $ REPORT_TYPE    : chr  "FM-12" "FM-12" "FM-12" "FM-12" ...
#>  $ CALL_SIGN      : int  99999 99999 99999 99999 99999 99999 99999 99999 99999 99999 ...
#>  $ QUALITY_CONTROL: chr  "V020" "V020" "V020" "V020" ...
#>  $ WND            : chr  "270,1,N,0159,1" "290,1,N,0082,1" "999,1,C,0000,1" "180,1,N,0082,1" ...
#>  $ CIG            : chr  "99999,9,9,N" "99999,9,9,N" "99999,9,9,N" "99999,9,9,N" ...
#>  $ VIS            : chr  "000000,1,N,9" "000000,1,N,9" "000000,1,N,9" "000000,1,N,9" ...
#>  $ TMP            : chr  "-0078,1" "-0072,1" "-0094,1" "-0061,1" ...
#>  $ DEW            : chr  "+9999,9" "+9999,9" "+9999,9" "+9999,9" ...
#>  $ SLP            : chr  "10200,1" "10200,1" "10200,1" "10183,1" ...
#>  $ GF1            : chr  "08,99,1,99,9,99,9,99999,9,99,9,99,9" "04,99,1,99,9,99,9,99999,9,99,9,99,9" "08,99,1,99,9,99,9,99999,9,99,9,99,9" "08,99,1,99,9,99,9,99999,9,99,9,99,9" ...
#>  $ MW1            : chr  "" "" "" "" ...
#>  $ EQD            : chr  NA NA NA NA ...
head(combined_ISDGlobalHourlyData)
#>   Year    STATION                DATE SOURCE LATITUDE LONGITUDE ELEVATION
#> 1 1901 2907099999 1901-01-01T06:00:00      4 64.33333     23.45         5
#> 2 1901 2907099999 1901-01-01T13:00:00      4 64.33333     23.45         5
#> 3 1901 2907099999 1901-01-01T20:00:00      4 64.33333     23.45         5
#> 4 1901 2907099999 1901-01-02T06:00:00      4 64.33333     23.45         5
#> 5 1901 2907099999 1901-01-02T13:00:00      4 64.33333     23.45         5
#> 6 1901 2907099999 1901-01-02T20:00:00      4 64.33333     23.45         5
#>                     NAME REPORT_TYPE CALL_SIGN QUALITY_CONTROL            WND
#> 1 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 270,1,N,0159,1
#> 2 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 290,1,N,0082,1
#> 3 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 999,1,C,0000,1
#> 4 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 180,1,N,0082,1
#> 5 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 180,1,N,0098,1
#> 6 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 180,1,N,0098,1
#>           CIG          VIS     TMP     DEW     SLP
#> 1 99999,9,9,N 000000,1,N,9 -0078,1 +9999,9 10200,1
#> 2 99999,9,9,N 000000,1,N,9 -0072,1 +9999,9 10200,1
#> 3 99999,9,9,N 000000,1,N,9 -0094,1 +9999,9 10200,1
#> 4 99999,9,9,N 000000,1,N,9 -0061,1 +9999,9 10183,1
#> 5 99999,9,9,N 000000,1,N,9 -0056,1 +9999,9 10176,1
#> 6 99999,9,9,N 000000,1,N,9 -0028,1 +9999,9 10175,1
#>                                   GF1 MW1  EQD
#> 1 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 2 04,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 3 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 4 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 5 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 6 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>

^由

Some of the files in https://www.ncei.noaa.gov/data/global-hourly/archive/csv/ are quite large (>4Gb) so you may need to increase the seconds before the connection times out and it's probably a good idea to delete tmp.tar.gz and the csv files after they've been loaded into R, but it's up to you.

Also, before you run this, the dir_ls() command searches for all .csv files in your working directory named 'digits'.csv (e.g. "02960099999.csv"); if you have other .csv files with the same naming scheme, you may have to alter the regex accordingly (e.g. 'x number of digits'.csv).

Here is one approach that might be suitable:

library(tidyverse)
library(fs)

options(timeout = 3600)

# For this demo, the 'dates_of_interest' are the years 1901-1906
# For the 'real' command, change this to "1901:2022" (or whatever)
dates_of_interest <- 1901:1906

ISDGlobalHourlyData <- list()

for (i in seq_along(dates_of_interest)) {
  fn <- paste0("https://www.ncei.noaa.gov/data/global-hourly/archive/csv/",
               dates_of_interest[i],
               ".tar.gz")
  download.file(fn, destfile = "tmp.tar.gz")
  untar("tmp.tar.gz")
  filelist <- dir_ls(regexp = "\\d+\\.csv")
  ISDGlobalHourlyData[[i]] <- do.call(dplyr::bind_rows,
                                      lapply(filelist, read.csv))
  # If you want to keep the files, delete these two lines:
  file.remove("tmp.tar.gz")
  file.remove(filelist)
}

names(ISDGlobalHourlyData) <- dates_of_interest
combined_ISDGlobalHourlyData <- bind_rows(ISDGlobalHourlyData, .id = "Year")

str(combined_ISDGlobalHourlyData)
#> 'data.frame':    38301 obs. of  20 variables:
#>  $ Year           : chr  "1901" "1901" "1901" "1901" ...
#>  $ STATION        : num  2.91e+09 2.91e+09 2.91e+09 2.91e+09 2.91e+09 ...
#>  $ DATE           : chr  "1901-01-01T06:00:00" "1901-01-01T13:00:00" "1901-01-01T20:00:00" "1901-01-02T06:00:00" ...
#>  $ SOURCE         : int  4 4 4 4 4 4 4 4 4 4 ...
#>  $ LATITUDE       : num  64.3 64.3 64.3 64.3 64.3 ...
#>  $ LONGITUDE      : num  23.4 23.4 23.4 23.4 23.4 ...
#>  $ ELEVATION      : num  5 5 5 5 5 5 5 5 5 5 ...
#>  $ NAME           : chr  "KALAJOKI ULKOKALLA, FI" "KALAJOKI ULKOKALLA, FI" "KALAJOKI ULKOKALLA, FI" "KALAJOKI ULKOKALLA, FI" ...
#>  $ REPORT_TYPE    : chr  "FM-12" "FM-12" "FM-12" "FM-12" ...
#>  $ CALL_SIGN      : int  99999 99999 99999 99999 99999 99999 99999 99999 99999 99999 ...
#>  $ QUALITY_CONTROL: chr  "V020" "V020" "V020" "V020" ...
#>  $ WND            : chr  "270,1,N,0159,1" "290,1,N,0082,1" "999,1,C,0000,1" "180,1,N,0082,1" ...
#>  $ CIG            : chr  "99999,9,9,N" "99999,9,9,N" "99999,9,9,N" "99999,9,9,N" ...
#>  $ VIS            : chr  "000000,1,N,9" "000000,1,N,9" "000000,1,N,9" "000000,1,N,9" ...
#>  $ TMP            : chr  "-0078,1" "-0072,1" "-0094,1" "-0061,1" ...
#>  $ DEW            : chr  "+9999,9" "+9999,9" "+9999,9" "+9999,9" ...
#>  $ SLP            : chr  "10200,1" "10200,1" "10200,1" "10183,1" ...
#>  $ GF1            : chr  "08,99,1,99,9,99,9,99999,9,99,9,99,9" "04,99,1,99,9,99,9,99999,9,99,9,99,9" "08,99,1,99,9,99,9,99999,9,99,9,99,9" "08,99,1,99,9,99,9,99999,9,99,9,99,9" ...
#>  $ MW1            : chr  "" "" "" "" ...
#>  $ EQD            : chr  NA NA NA NA ...
head(combined_ISDGlobalHourlyData)
#>   Year    STATION                DATE SOURCE LATITUDE LONGITUDE ELEVATION
#> 1 1901 2907099999 1901-01-01T06:00:00      4 64.33333     23.45         5
#> 2 1901 2907099999 1901-01-01T13:00:00      4 64.33333     23.45         5
#> 3 1901 2907099999 1901-01-01T20:00:00      4 64.33333     23.45         5
#> 4 1901 2907099999 1901-01-02T06:00:00      4 64.33333     23.45         5
#> 5 1901 2907099999 1901-01-02T13:00:00      4 64.33333     23.45         5
#> 6 1901 2907099999 1901-01-02T20:00:00      4 64.33333     23.45         5
#>                     NAME REPORT_TYPE CALL_SIGN QUALITY_CONTROL            WND
#> 1 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 270,1,N,0159,1
#> 2 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 290,1,N,0082,1
#> 3 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 999,1,C,0000,1
#> 4 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 180,1,N,0082,1
#> 5 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 180,1,N,0098,1
#> 6 KALAJOKI ULKOKALLA, FI       FM-12     99999            V020 180,1,N,0098,1
#>           CIG          VIS     TMP     DEW     SLP
#> 1 99999,9,9,N 000000,1,N,9 -0078,1 +9999,9 10200,1
#> 2 99999,9,9,N 000000,1,N,9 -0072,1 +9999,9 10200,1
#> 3 99999,9,9,N 000000,1,N,9 -0094,1 +9999,9 10200,1
#> 4 99999,9,9,N 000000,1,N,9 -0061,1 +9999,9 10183,1
#> 5 99999,9,9,N 000000,1,N,9 -0056,1 +9999,9 10176,1
#> 6 99999,9,9,N 000000,1,N,9 -0028,1 +9999,9 10175,1
#>                                   GF1 MW1  EQD
#> 1 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 2 04,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 3 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 4 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 5 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>
#> 6 08,99,1,99,9,99,9,99999,9,99,9,99,9     <NA>

^{Created on 2022-06-21 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~