从 URL 下载所有文件夹中的所有文件

发布于 2025-01-13 15:57:03 字数 563 浏览 0 评论 0原文

我想以相同的嵌套结构将嵌套文件夹中的所有文件从此 URL 递归下载到我的计算机:

https://hazardsdata.geoplatform.gov/?prefix=Region8/R8_MIT/Risk_MAP/Data/BLE/ South_Dakota/60601300_BrookingsCO/Brookings%20HYDA/

我尝试了几种不同的方法,使用curlRCurl,包括 这个和其他一些。该文件夹中有多种文件类型。但我不断遇到神秘的错误消息,例如 Error in function (type, msg, asError = TRUE) : error:1407742E:SSLroutines:SSL23_GET_SERVER_HELLO:tlsv1alert protocol version

我什至不知道如何开始。

I'd like to recursively download all files from nested folders from this URL to my computer in the same nested structure:

https://hazardsdata.geoplatform.gov/?prefix=Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings%20HYDA/

I've tried several different approaches, using curl and RCurl, including this and some others. There are multiple file types within this folder. But I keep running into cryptic error message such as Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

I'm not even sure how to begin.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

川水往事 2025-01-20 15:57:03

在他们的 javascript 中,你会找到 url https://hazards-geoplatform.s3.amazonaws.com/ ,在那里你会发现一个 xml 文件,其中包含(似乎?)他们所有文件的路径,从那里开始应该不难,所以

1:从 https://hazards-geoplatform.s3.amazonaws.com

2:每个 XML 的 标记描述一个文件或文件夹。过滤掉所有与您不相关的标签,这意味着如果 content->key 标签不包含文本 Brookings HYDA,请将其过滤掉。

3:其余内容标签包含您的下载路径和保存路径,对于每个以 / 结尾的关键标签:这是一个“文件夹”,您无法下载文件夹,只需创建路径,例如,如果密钥是

<Contents> 
    <Key>Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/</Key>

这意味着您应该创建文件夹 Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence 并继续,但是,如果密钥的值/ 结尾,则意味着您应该下载它,例如,如果您发现

<Contents>
    <Key>Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf</Key>
    <LastModified>2022-03-04T17:54:48.000Z</LastModified>
    <ETag>"9fe9af393f043faaa8e368f324c8404a"</ETag>
    <Size>303737</Size>
    <StorageClass>STANDARD</StorageClass>
</Contents>

这意味着保存文件路径是Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
下载文件的 url 为 https://hazards-geoplatform.s3.amazonaws.com/ + urlencode(key),在本例中:
<一href="https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_布鲁金斯CO%2F布鲁金斯%20HYDA%2FHydraulics_DataCapture%2FCorrespondence%2F200724-CityBrookings-AirportInfo_Email.pdf" rel="nofollow noreferrer">https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F6060130 0_BrookingsCO%2FBrookings%20HYDA%2FHydraulics_DataCapture%2FCorrespondence%2F200724-CityBrookings-AirportInfo_Email.pdf

我不知道如何使用curl/r来做到这一点,但这里是如何在PHP中做到这一点,

<?php
declare(strict_types=1);
function curl_get(string $url): string
{
    echo "fetching {$url}\n";
    static $ch = null;
    if ($ch === null) {
        $ch = curl_init();
        curl_setopt_array($ch, array(
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_ENCODING => '',
            CURLOPT_FOLLOWLOCATION=>1,
            CURLOPT_VERBOSE=>0
        ));
    }
    curl_setopt($ch, CURLOPT_URL, $url);
    $ret = curl_exec($ch);
    if(curl_errno($ch)) {
        throw new Exception("curl error ".curl_errno($ch).": ".curl_error($ch));
    }
    return $ret;
}
$base_url = 'https://hazards-geoplatform.s3.amazonaws.com/';
$xml = curl_get($base_url);
$domd = new DOMDocument();
@($domd->loadHTML($xml));
$xp = new DOMXPath($domd);
foreach($xp->query("//key[contains(text(),'Brookings HYDA')]") as $node) {
    $relative = $node->nodeValue;
    if($relative[-1] === '/'){
         // it's a folder, ignore
        continue;
    }
    $dir = dirname($relative);
    if(!is_dir($dir)) {
        mkdir($dir, 0777, true);
    }
    $url = $base_url . urlencode($node->nodeValue);
    file_put_contents($relative, curl_get($url));
}

运行几秒钟后很高兴移植,我觉得

$ find
.
./fuk.php
./Region8
./Region8/R8_MIT
./Region8/R8_MIT/Risk_MAP
./Region8/R8_MIT/Risk_MAP/Data
./Region8/R8_MIT/Risk_MAP/Data/BLE
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/2D_Exceptions_2021Update.pdf
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/DCS_Checklist_Hydraulics_BrookingsCoSD.xlsx
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations/RAS
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations/RAS/0.2PAC

它似乎可以工作。

  • 该命令的最后一个输出
fetching https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_BrookingsCO%2FBrookings+HYDA%2FHydraulics_DataCapture%2FSimulations%2FRAS%2F0.2PAC%2FPostProcessing.hdf
PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 65019904 bytes) in /home/hans/test/fuk.php on line 17

意味着它们的某些文件大小超过 134MB - 很容易优化curl代码以直接写入磁盘,而不是在写入磁盘之前将整个文件存储在ram中,但是因为您想这样做无论如何,在 R 中,我不会费心优化示例 php 脚本。

in their javascript you'll find the url https://hazards-geoplatform.s3.amazonaws.com/ and there you'll find a xml file containing the path to (seemingly?) all their files, from there it shouldn't be hard, so

1: download the XML list of files from https://hazards-geoplatform.s3.amazonaws.com

2: each of the XML's <content> tag describes a file or a folder. filter out all the tags that is not relevant to you, that means if the content->key tag does not contain the text Brookings HYDA, filter it out.

3: the remaining content tags contain your download path and save path, for every key tag that ends with /: this is a "folder", you can't download a fol6der, just create the path, for example if the key is

<Contents> 
    <Key>Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/</Key>

this means you should create the folders Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence and move on, however if the key's value does not end with /, it means you should download it, for example if you find

<Contents>
    <Key>Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf</Key>
    <LastModified>2022-03-04T17:54:48.000Z</LastModified>
    <ETag>"9fe9af393f043faaa8e368f324c8404a"</ETag>
    <Size>303737</Size>
    <StorageClass>STANDARD</StorageClass>
</Contents>

it means the save filepath is Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
and the url to download the file is https://hazards-geoplatform.s3.amazonaws.com/ + urlencode(key), in this case:
https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_BrookingsCO%2FBrookings%20HYDA%2FHydraulics_DataCapture%2FCorrespondence%2F200724-CityBrookings-AirportInfo_Email.pdf

idk how to do it with curl/r, but here's how to do it in PHP, happy porting

<?php
declare(strict_types=1);
function curl_get(string $url): string
{
    echo "fetching {$url}\n";
    static $ch = null;
    if ($ch === null) {
        $ch = curl_init();
        curl_setopt_array($ch, array(
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_ENCODING => '',
            CURLOPT_FOLLOWLOCATION=>1,
            CURLOPT_VERBOSE=>0
        ));
    }
    curl_setopt($ch, CURLOPT_URL, $url);
    $ret = curl_exec($ch);
    if(curl_errno($ch)) {
        throw new Exception("curl error ".curl_errno($ch).": ".curl_error($ch));
    }
    return $ret;
}
$base_url = 'https://hazards-geoplatform.s3.amazonaws.com/';
$xml = curl_get($base_url);
$domd = new DOMDocument();
@($domd->loadHTML($xml));
$xp = new DOMXPath($domd);
foreach($xp->query("//key[contains(text(),'Brookings HYDA')]") as $node) {
    $relative = $node->nodeValue;
    if($relative[-1] === '/'){
         // it's a folder, ignore
        continue;
    }
    $dir = dirname($relative);
    if(!is_dir($dir)) {
        mkdir($dir, 0777, true);
    }
    $url = $base_url . urlencode($node->nodeValue);
    file_put_contents($relative, curl_get($url));
}

after running that for a few seconds i have

$ find
.
./fuk.php
./Region8
./Region8/R8_MIT
./Region8/R8_MIT/Risk_MAP
./Region8/R8_MIT/Risk_MAP/Data
./Region8/R8_MIT/Risk_MAP/Data/BLE
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/2D_Exceptions_2021Update.pdf
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/DCS_Checklist_Hydraulics_BrookingsCoSD.xlsx
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations/RAS
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations/RAS/0.2PAC

soo it seems to be working.

  • the last output from the command is
fetching https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_BrookingsCO%2FBrookings+HYDA%2FHydraulics_DataCapture%2FSimulations%2FRAS%2F0.2PAC%2FPostProcessing.hdf
PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 65019904 bytes) in /home/hans/test/fuk.php on line 17

meaning some of their files are over 134MB in size - it's easy to optimize the curl code to write directly to disk instead of storing the entire file in ram before writing to disk, but since you want to do this in R anyway, i won't bother optimizing the sample php script.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文