用rvest从itu下载链接刮擦数据
我想获得网站上每个文件的下载链接/指标/但是正在努力获得我的需求。
每个指示符似乎都包含一个直接链接,以以下格式下载数据https://api.datahub.itu.intu.int/v2/data/downa/download/download/xxx/xxx/iscollection/yyyy
其中<代码> xxx 是1到100,000左右之间的某种数字,yyy
是true
或false
。
理想情况下,我想在一个大数据框架中获取链接的每个指标和相应的名称/HTML文本。
我试图使用rvest
以及html_nodes
和html_attrs
and xpaths的各种组合获取文件的链接。但是没有任何运气。我真的很想避免运行循环和蛮力100,000+下载链接,因为这效率非常低,几乎肯定会给他们的服务器带来问题。
我不确定是否有比使用Rvest更好的方法,但是任何帮助都将不胜感激。
library(rvest)
library(httr)
library(tidyverse)
library(dplyr)
page = "https://datahub.itu.int/indicators/"
read_html(page) %>%
html_attr("href")
I am wanting to get the download links for each of the files on the website https://datahub.itu.int/indicators/ but am struggling to get what I need.
Each indicator seems to contain a direct link to download the data in the following format https://api.datahub.itu.int/v2/data/download/byid/XXX/iscollection/YYY
where XXX
is some sort of number between 1 and 100,000+ or so and YYY
is either true
or false
.
Ideally, I would like to get the link to each indicator and a corresponding name/html text of the link in one big dataframe.
I have tried to get the links for the files using rvest
and various combinations of html_nodes
and html_attrs
and xpaths. but have not had any luck. I really want to avoid running a loop and brute force 100,000+ download links because that is horribly inefficient and will almost certainly cause issues for their servers.
I am not sure if there is a better way than using rvest, but any help would be most appreciated.
library(rvest)
library(httr)
library(tidyverse)
library(dplyr)
page = "https://datahub.itu.int/indicators/"
read_html(page) %>%
html_attr("href")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您查看页面提出的请求(例如在浏览器DevTools中),您会发现对API的请求可以检索所有链接;由此,您可以自己构建URL :(另一个解决方案是使用
rselenium
,但这会更加复杂)在2022-07-05创建的 reprex package (v2.0.1)
If you look at the requests the pages makes (e.g. in the browser devtools) you will find that there is a request to an api which retrieves all the link; from this you can build the urls yourself: (the other solution would be to use
RSelenium
, but this would be much more complicated)Created on 2022-07-05 by the reprex package (v2.0.1)