如何从存储的数据/月/年的表单索引中下载几个数据集。。

发布于 2025-02-13 15:00:41 字数 577 浏览 0 评论 0原文

我需要在每月决议和几年下下载气候数据集。数据可在此处找到： https://opendata.dwd.de/climate_environment/cdc/grids_germany/monthly/monthly/air_temperature_mean/

我可以通过单击它们并保存它们来下载唯一的文件。但是，如何下载几个数据集（例如特定年份？），或者只需在目录中下载所有文件？我相信应该使用一些FTP连接或一些R编码（在R Studio中）有一种自动方法，但找不到任何相关建议。我是Windows 10用户。拜托，从哪里开始？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花落人断肠 2025-02-20 15:00:41

尝试以下操作：

library(rvest)
baseurl <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"
res <- read_html(baseurl)
urls1 <- html_nodes(res, "a") %>%
  html_attr("href") %>%
  Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
  paste0(baseurl, .)

这使我们成为第一级，

urls1
#  [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/"                                                      
#  [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/02_Feb/"                                                      
#  [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/03_Mar/"                                                      
#  [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/04_Apr/"                                                      
#  [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/05_May/"                                                      
#  [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/06_Jun/"                                                      
#  [7] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/07_Jul/"                                                      
#  [8] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/08_Aug/"                                                      
#  [9] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/09_Sep/"                                                      
# [10] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/10_Oct/"                                                      
# [11] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/11_Nov/"                                                      
# [12] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/12_Dec/"                                                      
# [13] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [14] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf"

如您所见，相同的操作：

urls2 <- lapply(grep("/$", urls1, value = TRUE), function(url) {
  res2 <- read_html(url)
  html_nodes(res2, "a") %>%
    html_attr("href") %>%
    Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
    paste0(url, .)
})

每个文件夹都包含141-142不同的文件

lengths(urls2)
#  [1] 142 142 142 142 142 142 141 141 141 141 141 141

### confirm no more directories
sapply(urls2, function(z) any(grepl("/$", z)))
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

：

有些是文件，有些是目录。我们可以迭代这些URL进行结合来自文件的urls1（两个.pdf文件）的那些

allurls <- c(grep("/$", urls1, value = TRUE, invert = TRUE), unlist(urls2))
length(allurls)
# [1] 1700

head(allurls)
# [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf" 
# [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz"     
# [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188201.asc.gz"     
# [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz"     
# [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188401.asc.gz"

，现在您可以根据需要过滤并下载所需的内容：

needthese <- allurls[c(3,5)]
ign <- mapply(download.file, needthese, basename(needthese))
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz'
# Content type 'application/octet-stream' length 221215 bytes (216 KB)
# downloaded 216 KB
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz'
# Content type 'application/octet-stream' length 217413 bytes (212 KB)
# downloaded 212 KB
file.info(list.files(pattern = "gz$"))
#                                                     size isdir mode               mtime               ctime               atime exe
# grids_germany_monthly_air_temp_mean_188101.asc.gz 221215 FALSE  666 2022-07-06 09:17:21 2022-07-06 09:17:19 2022-07-06 09:17:52  no
# grids_germany_monthly_air_temp_mean_188301.asc.gz 217413 FALSE  666 2022-07-06 09:17:22 2022-07-06 09:17:21 2022-07-06 09:17:52  no

Try this:

library(rvest)
baseurl <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"
res <- read_html(baseurl)
urls1 <- html_nodes(res, "a") %>%
  html_attr("href") %>%
  Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
  paste0(baseurl, .)

This gets us the first level,

urls1
#  [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/"                                                      
#  [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/02_Feb/"                                                      
#  [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/03_Mar/"                                                      
#  [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/04_Apr/"                                                      
#  [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/05_May/"                                                      
#  [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/06_Jun/"                                                      
#  [7] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/07_Jul/"                                                      
#  [8] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/08_Aug/"                                                      
#  [9] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/09_Sep/"                                                      
# [10] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/10_Oct/"                                                      
# [11] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/11_Nov/"                                                      
# [12] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/12_Dec/"                                                      
# [13] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [14] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf"

As you can see, some are files, some are directories. We can iterate over these URLs to do the same thing:

urls2 <- lapply(grep("/quot;, urls1, value = TRUE), function(url) {
  res2 <- read_html(url)
  html_nodes(res2, "a") %>%
    html_attr("href") %>%
    Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
    paste0(url, .)
})

Each of those folders contain 141-142 different files:

lengths(urls2)
#  [1] 142 142 142 142 142 142 141 141 141 141 141 141

### confirm no more directories
sapply(urls2, function(z) any(grepl("/quot;, z)))
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

(This would not be difficult to transform into a recursive search vice a fixed-2-deep search.)

These files can all be combined with those from urls1 that were files (the two .pdf files)

allurls <- c(grep("/quot;, urls1, value = TRUE, invert = TRUE), unlist(urls2))
length(allurls)
# [1] 1700

head(allurls)
# [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf" 
# [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz"     
# [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188201.asc.gz"     
# [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz"     
# [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188401.asc.gz"

And now you can filter as desired and download those that are needed:

needthese <- allurls[c(3,5)]
ign <- mapply(download.file, needthese, basename(needthese))
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz'
# Content type 'application/octet-stream' length 221215 bytes (216 KB)
# downloaded 216 KB
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz'
# Content type 'application/octet-stream' length 217413 bytes (212 KB)
# downloaded 212 KB
file.info(list.files(pattern = "gzquot;))
#                                                     size isdir mode               mtime               ctime               atime exe
# grids_germany_monthly_air_temp_mean_188101.asc.gz 221215 FALSE  666 2022-07-06 09:17:21 2022-07-06 09:17:19 2022-07-06 09:17:52  no
# grids_germany_monthly_air_temp_mean_188301.asc.gz 217413 FALSE  666 2022-07-06 09:17:22 2022-07-06 09:17:21 2022-07-06 09:17:52  no

回复收藏 0 原文

似狗非友 2025-02-20 15:00:41

您可以使用rvest软件包来取消链接，并使用这些链接以以下方式下载特定月份的文件：

library(rvest)
library(stringr)

page_link <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"

month_name <- "01_Jan"
# you can set month_name as "05_May" to get the data from 05_May

# getting the html page for 01_Jan folder
page <- read_html(paste0(page_link, month_name, "/"))

# getting the link text
link_text <- page %>% 
  html_elements("a") %>% 
  html_text()

# creating links
links <- paste0(page_link, month_name, "/", link_text)[-1]

# extracting the numbers for filename
filenames <- stringr::str_extract(pattern = "\\d+", string = link_text[-1]) 

# creating a directory
dir.create(month_name)

# setting the option for maximizing time limits for downloading
options(timeout = max(600, getOption("timeout")))

# downloading the file
for (i in seq_along(links)) {
  download.file(links[i], paste0(month_name, "/", filenames[i], "asc.gz"))
}

You can use rvest package for scrapping the links and use those links to download the files for a specific month in the following way:

library(rvest)
library(stringr)

page_link <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"

month_name <- "01_Jan"
# you can set month_name as "05_May" to get the data from 05_May

# getting the html page for 01_Jan folder
page <- read_html(paste0(page_link, month_name, "/"))

# getting the link text
link_text <- page %>% 
  html_elements("a") %>% 
  html_text()

# creating links
links <- paste0(page_link, month_name, "/", link_text)[-1]

# extracting the numbers for filename
filenames <- stringr::str_extract(pattern = "\\d+", string = link_text[-1]) 

# creating a directory
dir.create(month_name)

# setting the option for maximizing time limits for downloading
options(timeout = max(600, getOption("timeout")))

# downloading the file
for (i in seq_along(links)) {
  download.file(links[i], paste0(month_name, "/", filenames[i], "asc.gz"))
}

回复收藏 0 原文

~没有更多了~