WGet的下载顺序逻辑

发布于 2024-12-10 07:03:36 字数 809 浏览 0 评论 0原文

这是一个更普遍的问题,但对我正在运行的数据挖掘项目具有更广泛的影响。我一直在使用 wget 镜像存档网页进行分析。这是一个很大的数据量,我目前的镜像过程已经持续了将近一周。这给了我很多时间来观看读数。

wget 如何确定下载页面的顺序?我似乎无法辨别其决策过程的一致逻辑(它不是按字母顺序、原始网站创建日期或文件类型进行)。当我开始处理数据时,这将非常有助于理解。

FWIW,这是我正在使用的命令(它需要 cookie,虽然该网站的 TOS 确实允许“以任何方式”访问,但我不想冒任何机会)-其中 SITE = URL:

wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE

编辑为添加: 在对 Chown 有用答案的评论中,我稍微改进了我的问题,所以就在这里。对于较大的站点 - 例如 epe.lac- bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp - 我发现它最初创建了一个目录结构和一些index.html/default.html 页面,但随后又返回不同的网站几次(例如,在每次传递时抓取更多图像和子页面)

This is a more general question, but which has wider implications for a data mining project I'm running. I have been using wget to mirror archival webpages for analysis. This is a large amount of data and my current mirroring process has been going on for almost a week. Which has given me a lot of time to watch the readout.

How does wget determine the order in which it downloads pages? I can't seem to discern a consistant logic to its decision making process (it's not proceeding alphabetically, by date of original site creation, or by file type). As I begin to work with the data, this would be very helpful to grasp.

FWIW, here is the command that I'm using (it required cookies, and while the site's TOS do allow access 'by any means' I don't want to take any chances) - where SITE = URL:

wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE

Edited to Add: In comments to Chown's helpful answer, I refined my question a bit so here it is. With larger sites - say epe.lac-bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp - I find that it goes through initially creating a directory structure and some of the index.html/default.html pages, but then goes back through the disparate websites a few more times (grabbing a few more images and sub-pages on each pass, for example)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

相守太难 2024-12-17 07:03:36

来自 gnu.org wget 递归下载

  • 递归下载

GNU Wget 能够遍历部分 Web(或单个 http
或 ftp 服务器),以下链接和目录结构。我们参考
这与递归检索或递归有关。

使用 http url,Wget 检索并解析来自 http url 的 html 或 css
给定 url,通过标记检索文档引用的文件
例如 href 或 src,或使用 'url()' 指定的 css uri 值
函数符号。如果新下载的文件也是类型
text/html、application/xhtml+xml 或 text/css,它将被解析并
进一步跟进。

http 和 html/css 内容的递归检索是广度优先
这意味着 Wget 首先下载请求的文档,然后
从该文档链接的文档,然后是由
他们,等等。换句话说,Wget 首先下载文档
深度 1,然后是深度 2,依此类推,直到达到指定的最大值
深度。

检索可能下降的最大深度由
'-l' 选项。默认最大深度为五层。

当递归检索 ftp url 时,Wget 将检索所有
给定目录树中的数据(包括子目录)
指定深度)在远程服务器上,创建其镜像
本地。 ftp 检索也受到深度参数的限制。不像
http递归,ftp递归是深度优先执行的。

默认情况下,Wget会创建一个本地目录树,对应
在远程服务器上找到的那个。

...剪断....

应谨慎使用递归检索。别说你不是
警告。


根据我自己的基本测试,当结构深度为 1 时,它按照页面从上到下的出现顺序排列:

[ 16:28 root@host /var/www/html ]# cat index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en-US">
    <head>
        <link rel="stylesheet" type="text/css" href="style.css">
    </head>
    <body>
        <div style="text-align:center;">
            <h2>Mobile Test Page</h2>
        </div>
        <a href="/c.htm">c</a>
        <a href="/a.htm">a</a>
        <a href="/b.htm">b</a>
    </body>
</html>



[ 16:28 jon@host ~ ]$ wget -m http://98.164.214.224:8000
--2011-10-15 16:28:51--  http://98.164.214.224:8000/
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 556 [text/html]
Saving to: "98.164.214.224:8000/index.html"

100%[====================================================================================================================================================================================================>] 556         --.-K/s   in 0s

2011-10-15 16:28:51 (19.7 MB/s) - "98.164.214.224:8000/index.html" saved [556/556]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/style.css
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 221 [text/css]
Saving to: "98.164.214.224:8000/style.css"

100%[====================================================================================================================================================================================================>] 221         --.-K/s   in 0s

2011-10-15 16:28:51 (777 KB/s) - "98.164.214.224:8000/style.css" saved [221/221]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/c.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/html]
Saving to: "98.164.214.224:8000/c.htm"

    [ <=>                                                                                                                                                                                                 ] 0           --.-K/s   in 0s

2011-10-15 16:28:51 (0.00 B/s) - "98.164.214.224:8000/c.htm" saved [0/0]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/a.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/a.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (102 KB/s) - "98.164.214.224:8000/a.htm" saved [2/2]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/b.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/b.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (85.8 KB/s) - "98.164.214.224:8000/b.htm" saved [2/2]

FINISHED --2011-10-15 16:28:51--
Downloaded: 5 files, 781 in 0s (2.15 MB/s)

From gnu.org wget Recursive Download:

  • Recursive Download

GNU Wget is capable of traversing parts of the Web (or a single http
or ftp server), following links and directory structure. We refer to
this as to recursive retrieval, or recursion.

With http urls, Wget retrieves and parses the html or css from the
given url, retrieving the files the document refers to, through markup
like href or src, or css uri values specified using the ‘url()’
functional notation. If the freshly downloaded file is also of type
text/html, application/xhtml+xml, or text/css, it will be parsed and
followed further.

Recursive retrieval of http and html/css content is breadth-first.
This means that Wget first downloads the requested document, then the
documents linked from that document, then the documents linked by
them, and so on. In other words, Wget first downloads the documents at
depth 1, then those at depth 2, and so on until the specified maximum
depth.

The maximum depth to which the retrieval may descend is specified with
the ‘-l’ option. The default maximum depth is five layers.

When retrieving an ftp url recursively, Wget will retrieve all the
data from the given directory tree (including the subdirectories up to
the specified depth) on the remote server, creating its mirror image
locally. ftp retrieval is also limited by the depth parameter. Unlike
http recursion, ftp recursion is performed depth-first.

By default, Wget will create a local directory tree, corresponding to
the one found on the remote server.

.... snip ....

Recursive retrieval should be used with care. Don't say you were not
warned.


From my own very basic testing, it goes in order of appearance from top to bottom of the page when the structure depth is 1:

[ 16:28 root@host /var/www/html ]# cat index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en-US">
    <head>
        <link rel="stylesheet" type="text/css" href="style.css">
    </head>
    <body>
        <div style="text-align:center;">
            <h2>Mobile Test Page</h2>
        </div>
        <a href="/c.htm">c</a>
        <a href="/a.htm">a</a>
        <a href="/b.htm">b</a>
    </body>
</html>



[ 16:28 jon@host ~ ]$ wget -m http://98.164.214.224:8000
--2011-10-15 16:28:51--  http://98.164.214.224:8000/
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 556 [text/html]
Saving to: "98.164.214.224:8000/index.html"

100%[====================================================================================================================================================================================================>] 556         --.-K/s   in 0s

2011-10-15 16:28:51 (19.7 MB/s) - "98.164.214.224:8000/index.html" saved [556/556]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/style.css
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 221 [text/css]
Saving to: "98.164.214.224:8000/style.css"

100%[====================================================================================================================================================================================================>] 221         --.-K/s   in 0s

2011-10-15 16:28:51 (777 KB/s) - "98.164.214.224:8000/style.css" saved [221/221]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/c.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/html]
Saving to: "98.164.214.224:8000/c.htm"

    [ <=>                                                                                                                                                                                                 ] 0           --.-K/s   in 0s

2011-10-15 16:28:51 (0.00 B/s) - "98.164.214.224:8000/c.htm" saved [0/0]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/a.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/a.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (102 KB/s) - "98.164.214.224:8000/a.htm" saved [2/2]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/b.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/b.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (85.8 KB/s) - "98.164.214.224:8000/b.htm" saved [2/2]

FINISHED --2011-10-15 16:28:51--
Downloaded: 5 files, 781 in 0s (2.15 MB/s)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文