如何从存储的数据/月/年的表单索引中下载几个数据集。 。

发布于 2025-02-13 15:00:41 字数 577 浏览 0 评论 0原文

我需要在每月决议和几年下下载气候数据集。数据可在此处找到: https://opendata.dwd.de/climate_environment/cdc/grids_germany/monthly/monthly/air_temperature_mean/

我可以通过单击它们并保存它们来下载唯一的文件。但是,如何下载几个数据集(例如特定年份?),或者只需在目录中下载所有文件?我相信应该使用一些FTP连接或一些R编码(在R Studio中)有一种自动方法,但找不到任何相关建议。我是Windows 10用户。拜托,从哪里开始?

I need to download climatic datasets at monthly resolutions and several years. The data is available here: https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/
enter image description here

I can download unique files by clicking on them and saving them. But how I can download several datasets (how to filter for e.g. specific years?), or simply download all of the files within a directory? I am sure there should be an automatic way using some FTP connection, or some R coding (in R studio), but can't find any relevant suggestions. I am a Windows 10 user. Please, where to start?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

花落人断肠 2025-02-20 15:00:41

尝试以下操作:

library(rvest)
baseurl <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"
res <- read_html(baseurl)
urls1 <- html_nodes(res, "a") %>%
  html_attr("href") %>%
  Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
  paste0(baseurl, .)

这使我们成为第一级,

urls1
#  [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/"                                                      
#  [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/02_Feb/"                                                      
#  [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/03_Mar/"                                                      
#  [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/04_Apr/"                                                      
#  [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/05_May/"                                                      
#  [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/06_Jun/"                                                      
#  [7] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/07_Jul/"                                                      
#  [8] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/08_Aug/"                                                      
#  [9] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/09_Sep/"                                                      
# [10] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/10_Oct/"                                                      
# [11] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/11_Nov/"                                                      
# [12] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/12_Dec/"                                                      
# [13] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [14] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf" 

如您所见, 相同的操作:

urls2 <- lapply(grep("/$", urls1, value = TRUE), function(url) {
  res2 <- read_html(url)
  html_nodes(res2, "a") %>%
    html_attr("href") %>%
    Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
    paste0(url, .)
})

每个文件夹都包含141-142不同的文件

lengths(urls2)
#  [1] 142 142 142 142 142 142 141 141 141 141 141 141

### confirm no more directories
sapply(urls2, function(z) any(grepl("/$", z)))
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

有些是文件,有些是目录。我们可以迭代这些URL进行 结合来自文件的urls1(两个.pdf文件)的那些

allurls <- c(grep("/$", urls1, value = TRUE, invert = TRUE), unlist(urls2))
length(allurls)
# [1] 1700

head(allurls)
# [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf" 
# [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz"     
# [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188201.asc.gz"     
# [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz"     
# [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188401.asc.gz"     

,现在您可以根据需要过滤并下载所需的内容:

needthese <- allurls[c(3,5)]
ign <- mapply(download.file, needthese, basename(needthese))
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz'
# Content type 'application/octet-stream' length 221215 bytes (216 KB)
# downloaded 216 KB
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz'
# Content type 'application/octet-stream' length 217413 bytes (212 KB)
# downloaded 212 KB
file.info(list.files(pattern = "gz$"))
#                                                     size isdir mode               mtime               ctime               atime exe
# grids_germany_monthly_air_temp_mean_188101.asc.gz 221215 FALSE  666 2022-07-06 09:17:21 2022-07-06 09:17:19 2022-07-06 09:17:52  no
# grids_germany_monthly_air_temp_mean_188301.asc.gz 217413 FALSE  666 2022-07-06 09:17:22 2022-07-06 09:17:21 2022-07-06 09:17:52  no

Try this:

library(rvest)
baseurl <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"
res <- read_html(baseurl)
urls1 <- html_nodes(res, "a") %>%
  html_attr("href") %>%
  Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
  paste0(baseurl, .)

This gets us the first level,

urls1
#  [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/"                                                      
#  [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/02_Feb/"                                                      
#  [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/03_Mar/"                                                      
#  [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/04_Apr/"                                                      
#  [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/05_May/"                                                      
#  [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/06_Jun/"                                                      
#  [7] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/07_Jul/"                                                      
#  [8] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/08_Aug/"                                                      
#  [9] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/09_Sep/"                                                      
# [10] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/10_Oct/"                                                      
# [11] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/11_Nov/"                                                      
# [12] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/12_Dec/"                                                      
# [13] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [14] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf" 

As you can see, some are files, some are directories. We can iterate over these URLs to do the same thing:

urls2 <- lapply(grep("/
quot;, urls1, value = TRUE), function(url) {
  res2 <- read_html(url)
  html_nodes(res2, "a") %>%
    html_attr("href") %>%
    Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
    paste0(url, .)
})

Each of those folders contain 141-142 different files:

lengths(urls2)
#  [1] 142 142 142 142 142 142 141 141 141 141 141 141

### confirm no more directories
sapply(urls2, function(z) any(grepl("/
quot;, z)))
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

(This would not be difficult to transform into a recursive search vice a fixed-2-deep search.)

These files can all be combined with those from urls1 that were files (the two .pdf files)

allurls <- c(grep("/
quot;, urls1, value = TRUE, invert = TRUE), unlist(urls2))
length(allurls)
# [1] 1700

head(allurls)
# [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf" 
# [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz"     
# [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188201.asc.gz"     
# [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz"     
# [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188401.asc.gz"     

And now you can filter as desired and download those that are needed:

needthese <- allurls[c(3,5)]
ign <- mapply(download.file, needthese, basename(needthese))
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz'
# Content type 'application/octet-stream' length 221215 bytes (216 KB)
# downloaded 216 KB
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz'
# Content type 'application/octet-stream' length 217413 bytes (212 KB)
# downloaded 212 KB
file.info(list.files(pattern = "gz
quot;))
#                                                     size isdir mode               mtime               ctime               atime exe
# grids_germany_monthly_air_temp_mean_188101.asc.gz 221215 FALSE  666 2022-07-06 09:17:21 2022-07-06 09:17:19 2022-07-06 09:17:52  no
# grids_germany_monthly_air_temp_mean_188301.asc.gz 217413 FALSE  666 2022-07-06 09:17:22 2022-07-06 09:17:21 2022-07-06 09:17:52  no
似狗非友 2025-02-20 15:00:41

您可以使用rvest软件包来取消链接,并使用这些链接以以下方式下载特定月份的文件:

library(rvest)
library(stringr)

page_link <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"

month_name <- "01_Jan"
# you can set month_name as "05_May" to get the data from 05_May

# getting the html page for 01_Jan folder
page <- read_html(paste0(page_link, month_name, "/"))

# getting the link text
link_text <- page %>% 
  html_elements("a") %>% 
  html_text()

# creating links
links <- paste0(page_link, month_name, "/", link_text)[-1]

# extracting the numbers for filename
filenames <- stringr::str_extract(pattern = "\\d+", string = link_text[-1]) 

# creating a directory
dir.create(month_name)

# setting the option for maximizing time limits for downloading
options(timeout = max(600, getOption("timeout")))

# downloading the file
for (i in seq_along(links)) {
  download.file(links[i], paste0(month_name, "/", filenames[i], "asc.gz"))
}

You can use rvest package for scrapping the links and use those links to download the files for a specific month in the following way:

library(rvest)
library(stringr)

page_link <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"

month_name <- "01_Jan"
# you can set month_name as "05_May" to get the data from 05_May

# getting the html page for 01_Jan folder
page <- read_html(paste0(page_link, month_name, "/"))

# getting the link text
link_text <- page %>% 
  html_elements("a") %>% 
  html_text()

# creating links
links <- paste0(page_link, month_name, "/", link_text)[-1]

# extracting the numbers for filename
filenames <- stringr::str_extract(pattern = "\\d+", string = link_text[-1]) 

# creating a directory
dir.create(month_name)

# setting the option for maximizing time limits for downloading
options(timeout = max(600, getOption("timeout")))

# downloading the file
for (i in seq_along(links)) {
  download.file(links[i], paste0(month_name, "/", filenames[i], "asc.gz"))
}

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文