wget 的 -N 选项有问题
我正在尝试使用 wget 抓取网站。这是我的命令:
wget -t 3 -N -k -r -x
-N 表示“如果服务器版本早于本地版本,则不下载文件”。但这不起作用。当我重新启动上述抓取操作时,即使文件没有任何更改,相同的文件也会被一遍又一遍地下载。
许多下载的页面报告:
上次修改的标头丢失 - 时间戳已关闭。
我已经尝试抓取几个网站,但到目前为止所有尝试都出现了这个问题。
这是远程服务器控制的情况吗?他们是否选择不发送这些时间戳标头?如果是这样的话,我能做的可能不多吗?
我知道 -NC (无破坏)选项,但这将防止现有文件不被覆盖,即使服务器文件较新,从而导致陈旧的本地数据累积。
谢谢 德鲁
I am trying to scrape a website using wget. Here is my command:
wget -t 3 -N -k -r -x
The -N means "don't download file if server version older than local version". But this isn't working. The same files get downloaded over and over again when I restart the above scraping operation - even though the files have no changes.
Many of the downloaded pages report:
Last-modified header missing -- time-stamps turned off.
I've tried scraping several web sites but all tried so far give this problem.
Is this a situation controlled by the remote server? Are they choosing not so send those timestamp headers? If so, there may not be much I can do about it?
I am aware of the -NC (no clobber) option, but that will prevent an existing file not being overwritten even if the server file is newer, resulting in stale local data accumulating.
Thanks
Drew
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
wget -N
开关确实有效,但许多 Web 服务器由于各种原因不发送 Last-Modified 标头。例如,动态页面(PHP 或任何 CMS 等)必须主动实现功能(找出内容上次修改的时间,并发送标头)。有些会,有些则不会。实际上也没有其他可靠的方法来检查文件是否已更改。
The
wget -N
switch does work, but a lot of web servers don't send the Last-Modified header for various reasons. For example, dynamic pages (PHP or any CMS, etc.) have to actively implement the functionality (figure out when the content was last modified, and send the header). Some do, while some don't.There really isn't another reliable way to check if a file has been changed, either.