从 URL 下载所有文件夹中的所有文件
我想以相同的嵌套结构将嵌套文件夹中的所有文件从此 URL 递归下载到我的计算机:
https://hazardsdata.geoplatform.gov/?prefix=Region8/R8_MIT/Risk_MAP/Data/BLE/ South_Dakota/60601300_BrookingsCO/Brookings%20HYDA/
我尝试了几种不同的方法,使用curl
和 RCurl
,包括 这个和其他一些。该文件夹中有多种文件类型。但我不断遇到神秘的错误消息,例如 Error in function (type, msg, asError = TRUE) : error:1407742E:SSLroutines:SSL23_GET_SERVER_HELLO:tlsv1alert protocol version
我什至不知道如何开始。
I'd like to recursively download all files from nested folders from this URL to my computer in the same nested structure:
https://hazardsdata.geoplatform.gov/?prefix=Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings%20HYDA/
I've tried several different approaches, using curl
and RCurl
, including this and some others. There are multiple file types within this folder. But I keep running into cryptic error message such as Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I'm not even sure how to begin.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在他们的 javascript 中,你会找到 url
https://hazards-geoplatform.s3.amazonaws.com/
,在那里你会发现一个 xml 文件,其中包含(似乎?)他们所有文件的路径,从那里开始应该不难,所以1:从 https://hazards-geoplatform.s3.amazonaws.com
2:每个 XML 的
标记描述一个文件或文件夹。过滤掉所有与您不相关的标签,这意味着如果 content->key 标签不包含文本Brookings HYDA
,请将其过滤掉。3:其余内容标签包含您的下载路径和保存路径,对于每个以
/
结尾的关键标签:这是一个“文件夹”,您无法下载文件夹,只需创建路径,例如,如果密钥是这意味着您应该创建文件夹
Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence
并继续,但是,如果密钥的值不以/
结尾,则意味着您应该下载它,例如,如果您发现这意味着保存文件路径是Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
下载文件的 url 为
https://hazards-geoplatform.s3.amazonaws.com/
+ urlencode(key),在本例中:<一href="https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_布鲁金斯CO%2F布鲁金斯%20HYDA%2FHydraulics_DataCapture%2FCorrespondence%2F200724-CityBrookings-AirportInfo_Email.pdf" rel="nofollow noreferrer">https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F6060130 0_BrookingsCO%2FBrookings%20HYDA%2FHydraulics_DataCapture%2FCorrespondence%2F200724-CityBrookings-AirportInfo_Email.pdf
我不知道如何使用curl/r来做到这一点,但这里是如何在PHP中做到这一点,
运行几秒钟后很高兴移植,我觉得
它似乎可以工作。
意味着它们的某些文件大小超过 134MB - 很容易优化curl代码以直接写入磁盘,而不是在写入磁盘之前将整个文件存储在ram中,但是因为您想这样做无论如何,在 R 中,我不会费心优化示例 php 脚本。
in their javascript you'll find the url
https://hazards-geoplatform.s3.amazonaws.com/
and there you'll find a xml file containing the path to (seemingly?) all their files, from there it shouldn't be hard, so1: download the XML list of files from https://hazards-geoplatform.s3.amazonaws.com
2: each of the XML's
<content>
tag describes a file or a folder. filter out all the tags that is not relevant to you, that means if the content->key tag does not contain the textBrookings HYDA
, filter it out.3: the remaining content tags contain your download path and save path, for every key tag that ends with
/
: this is a "folder", you can't download a fol6der, just create the path, for example if the key isthis means you should create the folders
Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence
and move on, however if the key's value does not end with/
, it means you should download it, for example if you findit means the save filepath is
Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
and the url to download the file is
https://hazards-geoplatform.s3.amazonaws.com/
+ urlencode(key), in this case:https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_BrookingsCO%2FBrookings%20HYDA%2FHydraulics_DataCapture%2FCorrespondence%2F200724-CityBrookings-AirportInfo_Email.pdf
idk how to do it with curl/r, but here's how to do it in PHP, happy porting
after running that for a few seconds i have
soo it seems to be working.
meaning some of their files are over 134MB in size - it's easy to optimize the curl code to write directly to disk instead of storing the entire file in ram before writing to disk, but since you want to do this in R anyway, i won't bother optimizing the sample php script.