通过 http 获取目录列表
有一个通过网络提供的目录,我有兴趣监视它。它的内容是我正在使用的软件的各种版本,我想编写一个可以运行的脚本来检查其中的内容,并下载比我已经拥有的更新的任何内容。
有没有办法,比如使用 wget 或其他东西来获取目录列表。我尝试在目录上使用 wget
,这给了我 html。为了避免解析 html 文档,是否有一种方法可以检索像 ls
这样的简单列表?
There is a directory that is being served over the net which I'm interested in monitoring. Its contents are various versions of software that I'm using and I'd like to write a script that I could run which checks what's there, and downloads anything that is newer that what I've already got.
Is there a way, say with wget
or something, to get a a directory listing. I've tried using wget
on the directory, which gives me html. To avoid having to parse the html document, is there a way of retrieving a simple listing like ls
would give?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我刚刚找到了一种方法:
它非常冗长,因此您需要根据您想要的内容通过
grep
进行几次管道传输,但信息都在那里。它看起来像打印到 stderr,因此附加2>&1
让grep
对其进行处理。我搜索“\.tar\.gz”以找到该网站提供的所有 tarball。请注意,
wget
在工作目录中写入临时文件,并且不会清理其临时目录。如果出现问题,您可以更改为临时目录:I just figured out a way to do it:
It's quite verbose, so you need to pipe through
grep
a couple of times depending on what you're after, but the information is all there. It looks like it prints to stderr, so append2>&1
to letgrep
at it. I grepped for "\.tar\.gz" to find all of the tarballs the site had to offer.Note that
wget
writes temporary files in the working directory, and doesn't clean up its temporary directories. If this is a problem, you can change to a temporary directory:您所要求的最佳服务是使用 FTP,而不是 HTTP。
HTTP 没有目录列表的概念,而 FTP 有。
大多数 HTTP 服务器不允许访问目录列表,而那些允许访问的服务器只是将其作为服务器的一项功能,而不是 HTTP 协议。对于这些 HTTP 服务器,它们决定生成并发送 HTML 页面供人类使用,而不是机器使用。您对此无法控制,并且别无选择,只能解析 HTML。
FTP 是为机器消耗而设计的,更重要的是引入了
MLST
和MLSD
命令来取代不明确的LIST
命令。What you are asking for best served using FTP, not HTTP.
HTTP has no concept of directory listings, FTP does.
Most HTTP servers do not allow access to directory listings, and those that do are doing so as a feature of the server, not the HTTP protocol. For those HTTP servers, they are deciding to generate and send an HTML page for human consumption, not machine consumption. You have no control over that, and would have no choice but to parse the HTML.
FTP is designed for machine consumption, more so with the introduction of the
MLST
andMLSD
commands that replace the ambiguousLIST
command.以下不是递归的,但它对我有用:
输出是 HTML 并写入
stdout
。与wget
不同,没有任何内容写入磁盘。-s
(--silent
) 在管道输出时相关,尤其是在不能有噪音的脚本中。只要有可能,请记住不要使用
ftp
或http
代替https
。The following is not recursive, but it worked for me:
The output is HTML and is written to
stdout
. Unlike withwget
, there is nothing written to disk.-s
(--silent
) is relevant when piping the output, especially within a script that must not be noisy.Whenever possible, remember not to use
ftp
orhttp
instead ofhttps
.如果它由 http 提供服务,则无法获得简单的目录列表。您浏览时看到的列表(即 wget 正在检索的列表)是由 Web 服务器作为 HTML 页面生成的。您所能做的就是解析该页面并提取信息。
If it's being served by http then there's no way to get a simple directory listing. The listing you see when you browse there, which is the one wget is retrieving, is generated by the web server as an HTML page. All you can do is parse that page and extract the information.
AFAIK,出于安全目的,没有办法获得这样的目录列表。幸运的是,您的目标目录具有 HTML 列表,因为它允许您解析它并发现新的下载。
AFAIK, there is no way to get a directory listing like that for security purposes. It is rather lucky that your target directory has the HTML listing because it does allow you to parse it and discover new downloads.
您可以使用IDM(互联网下载管理器)
它有一个名为“IDM SITE GRABBER”的实用程序,输入
http/https
URL,它将为您从http/https
协议下载所有文件和文件夹。You can use IDM (internet download manager)
It has a utility named "IDM SITE GRABBER" input the
http/https
URLs and it will download all files and folders fromhttp/https
protocol for you.elinks
在这方面做得还算不错。只需elinks
即可通过终端与目录树交互。您还可以将内容转储到终端。在这种情况下,您可能需要诸如
--no-references
和--no-numbering
之类的标志。elinks
does a halfway decent job of this. Justelinks <URL>
to interact with a directory tree through the terminal.You can also dump the content to the terminal. In that case, you may want flags like
--no-references
and--no-numbering
.使用lftp:
Use
lftp
: