用rvest从itu下载链接刮擦数据

发布于 2025-02-12 23:44:34 字数 715 浏览 3 评论 0原文

我想获得网站上每个文件的下载链接/指标/但是正在努力获得我的需求。

每个指示符似乎都包含一个直接链接，以以下格式下载数据https://api.datahub.itu.intu.int/v2/data/downa/download/download/xxx/xxx/iscollection/yyyy其中<代码> xxx 是1到100,000左右之间的某种数字，yyy是true或false。

理想情况下，我想在一个大数据框架中获取链接的每个指标和相应的名称/HTML文本。

我试图使用rvest以及html_nodes和html_attrs and xpaths的各种组合获取文件的链接。但是没有任何运气。我真的很想避免运行循环和蛮力100,000+下载链接，因为这效率非常低，几乎肯定会给他们的服务器带来问题。

我不确定是否有比使用Rvest更好的方法，但是任何帮助都将不胜感激。

library(rvest)
library(httr)
library(tidyverse)
library(dplyr)

page = "https://datahub.itu.int/indicators/"
read_html(page) %>%
  html_attr("href")

原文

I am wanting to get the download links for each of the files on the website https://datahub.itu.int/indicators/ but am struggling to get what I need.

Each indicator seems to contain a direct link to download the data in the following format https://api.datahub.itu.int/v2/data/download/byid/XXX/iscollection/YYY where XXX is some sort of number between 1 and 100,000+ or so and YYY is either true or false.

Ideally, I would like to get the link to each indicator and a corresponding name/html text of the link in one big dataframe.

I have tried to get the links for the files using rvest and various combinations of html_nodes and html_attrs and xpaths. but have not had any luck. I really want to avoid running a loop and brute force 100,000+ download links because that is horribly inefficient and will almost certainly cause issues for their servers.

I am not sure if there is a better way than using rvest, but any help would be most appreciated.

library(rvest)
library(httr)
library(tidyverse)
library(dplyr)

page = "https://datahub.itu.int/indicators/"
read_html(page) %>%
  html_attr("href")

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

很糊涂小朋友 2025-02-19 23:44:34

如果您查看页面提出的请求（例如在浏览器DevTools中），您会发现对API的请求可以检索所有链接；由此，您可以自己构建URL ：（另一个解决方案是使用rselenium，但这会更加复杂）

library(httr)
library(tidyverse)

GET("https://api.datahub.itu.int/v2/dictionaries/getcategories") %>%
  content() %>%
  map(as_tibble) %>%
  bind_rows() %>%
  unnest_wider(subCategory) %>%
  unnest(items) %>%
  unnest_wider(items) %>%
  mutate(url = paste0("https://api.datahub.itu.int/v2/data/download/byid/",
                      codeID,
                      "/iscollection/",
                      tolower(as.character(isCollection)))) %>%
  select(category, codeID, label, subCategory, isCollection, url)
#> # A tibble: 181 × 6
#>    category     codeID label                      subCategory isCollection url  
#>    <chr>         <int> <chr>                      <chr>       <lgl>        <chr>
#>  1 Connectivity   8941 Households with a radio    Access      FALSE        http…
#>  2 Connectivity   8965 Households with a TV       Access      FALSE        http…
#>  3 Connectivity 100002 Households with multichan… Access      TRUE         http…
#>  4 Connectivity   8749 Households with telephone… Access      FALSE        http…
#>  5 Connectivity  20719 Individuals who own a mob… Access      FALSE        http…
#>  6 Connectivity  12046 Households with a computer Access      FALSE        http…
#>  7 Connectivity  12047 Households with Internet … Access      FALSE        http…
#>  8 Connectivity 100001 Households with access to… Access      TRUE         http…
#>  9 Connectivity 100000 Reasons for not having In… Access      TRUE         http…
#> 10 Connectivity     15 Fixed-telephone subscript… Access      FALSE        http…
#> # … with 171 more rows

^{在2022-07-05创建的 reprex package （v2.0.1）}

If you look at the requests the pages makes (e.g. in the browser devtools) you will find that there is a request to an api which retrieves all the link; from this you can build the urls yourself: (the other solution would be to use RSelenium, but this would be much more complicated)

library(httr)
library(tidyverse)

GET("https://api.datahub.itu.int/v2/dictionaries/getcategories") %>%
  content() %>%
  map(as_tibble) %>%
  bind_rows() %>%
  unnest_wider(subCategory) %>%
  unnest(items) %>%
  unnest_wider(items) %>%
  mutate(url = paste0("https://api.datahub.itu.int/v2/data/download/byid/",
                      codeID,
                      "/iscollection/",
                      tolower(as.character(isCollection)))) %>%
  select(category, codeID, label, subCategory, isCollection, url)
#> # A tibble: 181 × 6
#>    category     codeID label                      subCategory isCollection url  
#>    <chr>         <int> <chr>                      <chr>       <lgl>        <chr>
#>  1 Connectivity   8941 Households with a radio    Access      FALSE        http…
#>  2 Connectivity   8965 Households with a TV       Access      FALSE        http…
#>  3 Connectivity 100002 Households with multichan… Access      TRUE         http…
#>  4 Connectivity   8749 Households with telephone… Access      FALSE        http…
#>  5 Connectivity  20719 Individuals who own a mob… Access      FALSE        http…
#>  6 Connectivity  12046 Households with a computer Access      FALSE        http…
#>  7 Connectivity  12047 Households with Internet … Access      FALSE        http…
#>  8 Connectivity 100001 Households with access to… Access      TRUE         http…
#>  9 Connectivity 100000 Reasons for not having In… Access      TRUE         http…
#> 10 Connectivity     15 Fixed-telephone subscript… Access      FALSE        http…
#> # … with 171 more rows

^{Created on 2022-07-05 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~