获取 url 列表的 HTTP 状态代码的脚本?
我有一个需要检查的 URL 列表,看看它们是否仍然有效。我想编写一个 bash 脚本来为我做到这一点。
我只需要返回的 HTTP 状态代码,即 200、404、500 等。而已。
编辑请注意,如果页面显示“404 not found”但返回 200 OK 消息,则存在问题。这是一个配置错误的 Web 服务器,但您可能必须考虑这种情况。
有关详细信息,请参阅检查是否URL 转到包含文本“404”的页面
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
Curl 有一个特定的选项
--write-out
,为此:-o /dev/null
丢弃通常的输出--silent
丢弃进度表--head
发出 HEAD HTTP 请求,而不是 GET--write-out '%{http_code}\n'
打印 将其包装在完整的 Bash 脚本中:(目光敏锐的读者会注意到,这对每个 URL 使用一个curl 进程,这会造成分叉和 TCP 连接惩罚。如果将多个 URL 组合在一个curl 中,速度会更快,但没有空间来写出巨大的重复卷曲需要执行此操作的选项。)
Curl has a specific option,
--write-out
, for this:-o /dev/null
throws away the usual output--silent
throws away the progress meter--head
makes a HEAD HTTP request, instead of GET--write-out '%{http_code}\n'
prints the required status codeTo wrap this up in a complete Bash script:
(Eagle-eyed readers will notice that this uses one curl process per URL, which imposes fork and TCP connection penalties. It would be faster if multiple URLs were combined in a single curl, but there isn't space to write out the monsterous repetition of options that curl requires to do this.)
只为您打印状态代码
prints only the status code for you
扩展菲尔已经提供的答案。如果您使用 xargs 进行调用,则在 bash 中添加并行性是轻而易举的事。
这里的代码:
-n1:仅使用一个值(来自列表)作为curl调用的参数
-P10:随时保持10个curl进程处于活动状态(即10并行连接)
检查curl手册中的
write_out
参数以获取可以使用它提取的更多数据(时间等)。如果它对某人有帮助,这就是我当前正在使用的调用:
它只是将一堆数据输出到 csv 文件中,该文件可以导入到任何办公工具中。
Extending the answer already provided by Phil. Adding parallelism to it is a no brainer in bash if you use xargs for the call.
Here the code:
-n1: use just one value (from the list) as argument to the curl call
-P10: Keep 10 curl processes alive at any time (i.e. 10 parallel connections)
Check the
write_out
parameter in the manual of curl for more data you can extract using it (times, etc).In case it helps someone this is the call I'm currently using:
It just outputs a bunch of data into a csv file that can be imported into any office tool.
这依赖于广泛使用的
wget
,它几乎无处不在,甚至在 Alpine Linux 上也是如此。解释如下:
--quiet
--spider
--server-response
关于
--server-response
他们没有说的是这些标头输出被打印到 标准错误 (sterr),因此需要 重定向 到标准输入。输出发送到标准输入,我们可以将其通过管道传输到 awk 来提取 HTTP 状态代码。该代码是:
$2
) 非空白字符组:{$2}
NR==1< /code>
因为我们想打印它...
{print $2}
。This relies on widely available
wget
, present almost everywhere, even on Alpine Linux.The explanations are as follow :
--quiet
--spider
--server-response
What they don't say about
--server-response
is that those headers output are printed to standard error (sterr), thus the need to redirect to stdin.The output sent to standard input, we can pipe it to
awk
to extract the HTTP status code. That code is :$2
) non-blank group of characters:{$2}
NR==1
And because we want to print it...
{print $2}
.使用
curl
仅获取 HTTP 标头(而不是整个文件)并解析它:Use
curl
to fetch the HTTP-header only (not the whole file) and parse it:wget -S -i *file*
将为您获取文件中每个网址的标头。通过
grep
专门过滤状态代码。wget -S -i *file*
will get you the headers from each url in a file.Filter though
grep
for the status code specifically.我发现了一个用 Python 编写的工具“webchk”。返回 url 列表的状态代码。
https://pypi.org/project/webchk/
输出如下所示:
希望有帮助!
I found a tool "webchk” written in Python. Returns a status code for a list of urls.
https://pypi.org/project/webchk/
Output looks like this:
Hope that helps!
请记住,curl 并不总是可用(特别是在容器中),此解决方案存在问题:
即使 URL 不存在,它也会返回退出状态 0。
或者,这里是使用 wget 的合理容器运行状况检查:
虽然它可能无法为您提供确切的状态,但它至少会为您提供基于有效退出代码的运行状况响应(即使在端点上进行重定向)。
Keeping in mind that curl is not always available (particularly in containers), there are issues with this solution:
which will return exit status of 0 even if the URL doesn't exist.
Alternatively, here is a reasonable container health-check for using wget:
While it may not give you exact status out, it will at least give you a valid exit code based health responses (even with redirects on the endpoint).
由于 https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-P< /a> (
xargs
中并行作业的输出存在混合风险),我会使用 GNU并行而不是xargs
进行并行化:在这种特殊情况下,使用
xargs
可能是安全的,因为输出很短,因此使用xargs
会出现问题code> 更确切地说,如果有人后来更改代码来做更大的事情,它将不再安全。或者,如果有人读到这个问题并认为他可以用其他东西替换curl
,那么这也可能不安全。示例
url.lst
:Due to https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-P (output from parallel jobs in
xargs
risks being mixed), I would use GNU Parallel instead ofxargs
to parallelize:In this particular case it may be safe to use
xargs
because the output is so short, so the problem with usingxargs
is rather that if someone later changes the code to do something bigger, it will no longer be safe. Or if someone reads this question and thinks he can replacecurl
with something else, then that may also not be safe.Example
url.lst
: