WGet的下载顺序逻辑
这是一个更普遍的问题,但对我正在运行的数据挖掘项目具有更广泛的影响。我一直在使用 wget 镜像存档网页进行分析。这是一个很大的数据量,我目前的镜像过程已经持续了将近一周。这给了我很多时间来观看读数。
wget 如何确定下载页面的顺序?我似乎无法辨别其决策过程的一致逻辑(它不是按字母顺序、原始网站创建日期或文件类型进行)。当我开始处理数据时,这将非常有助于理解。
FWIW,这是我正在使用的命令(它需要 cookie,虽然该网站的 TOS 确实允许“以任何方式”访问,但我不想冒任何机会)-其中 SITE = URL:
wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE
编辑为添加: 在对 Chown 有用答案的评论中,我稍微改进了我的问题,所以就在这里。对于较大的站点 - 例如 epe.lac- bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp - 我发现它最初创建了一个目录结构和一些index.html/default.html 页面,但随后又返回不同的网站几次(例如,在每次传递时抓取更多图像和子页面)
This is a more general question, but which has wider implications for a data mining project I'm running. I have been using wget to mirror archival webpages for analysis. This is a large amount of data and my current mirroring process has been going on for almost a week. Which has given me a lot of time to watch the readout.
How does wget determine the order in which it downloads pages? I can't seem to discern a consistant logic to its decision making process (it's not proceeding alphabetically, by date of original site creation, or by file type). As I begin to work with the data, this would be very helpful to grasp.
FWIW, here is the command that I'm using (it required cookies, and while the site's TOS do allow access 'by any means' I don't want to take any chances) - where SITE = URL:
wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE
Edited to Add: In comments to Chown's helpful answer, I refined my question a bit so here it is. With larger sites - say epe.lac-bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp - I find that it goes through initially creating a directory structure and some of the index.html/default.html pages, but then goes back through the disparate websites a few more times (grabbing a few more images and sub-pages on each pass, for example)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
来自 gnu.org wget 递归下载:
根据我自己的基本测试,当结构深度为 1 时,它按照页面从上到下的出现顺序排列:
From gnu.org wget Recursive Download:
From my own very basic testing, it goes in order of appearance from top to bottom of the page when the structure depth is 1: