获取 url 列表的 HTTP 状态代码的脚本?

发布于 2024-11-10 06:31:18 字数 350 浏览 6 评论 0 原文

我有一个需要检查的 URL 列表,看看它们是否仍然有效。我想编写一个 bash 脚本来为我做到这一点。

我只需要返回的 HTTP 状态代码,即 200、404、500 等。而已。

编辑请注意,如果页面显示“404 not found”但返回 200 OK 消息,则存在问题。这是一个配置错误的 Web 服务器,但您可能必须考虑这种情况。

有关详细信息,请参阅检查是否URL 转到包含文本“404”的页面

I have a list of URLS that I need to check, to see if they still work or not. I would like to write a bash script that does that for me.

I only need the returned HTTP status code, i.e. 200, 404, 500 and so forth. Nothing more.

EDIT Note that there is an issue if the page says "404 not found" but returns a 200 OK message. It's a misconfigured web server, but you may have to consider this case.

For more on this, see Check if a URL goes to a page containing the text "404"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

忆梦 2024-11-17 06:31:18

Curl 有一个特定的选项 --write-out,为此:

$ curl -o /dev/null --silent --head --write-out '%{http_code}\n' <url>
200
  • -o /dev/null 丢弃通常的输出
  • --silent丢弃进度表
  • --head 发出 HEAD HTTP 请求,而不是 GET
  • 所需的状态代码

--write-out '%{http_code}\n'打印 将其包装在完整的 Bash 脚本中:

#!/bin/bash
while read LINE; do
  curl -o /dev/null --silent --head --write-out "%{http_code} $LINE\n" "$LINE"
done < url-list.txt

(目光敏锐的读者会注意到,这对每个 URL 使用一个curl 进程,这会造成分叉和 TCP 连接惩罚。如果将多个 URL 组合在一个curl 中,速度会更快,但没有空间来写出巨大的重复卷曲需要执行此操作的选项。)

Curl has a specific option, --write-out, for this:

$ curl -o /dev/null --silent --head --write-out '%{http_code}\n' <url>
200
  • -o /dev/null throws away the usual output
  • --silent throws away the progress meter
  • --head makes a HEAD HTTP request, instead of GET
  • --write-out '%{http_code}\n' prints the required status code

To wrap this up in a complete Bash script:

#!/bin/bash
while read LINE; do
  curl -o /dev/null --silent --head --write-out "%{http_code} $LINE\n" "$LINE"
done < url-list.txt

(Eagle-eyed readers will notice that this uses one curl process per URL, which imposes fork and TCP connection penalties. It would be faster if multiple URLs were combined in a single curl, but there isn't space to write out the monsterous repetition of options that curl requires to do this.)

┈┾☆殇 2024-11-17 06:31:18
wget --spider -S "http://url/to/be/checked" 2>&1 | grep "HTTP/" | awk '{print $2}'

只为您打印状态代码

wget --spider -S "http://url/to/be/checked" 2>&1 | grep "HTTP/" | awk '{print $2}'

prints only the status code for you

喵星人汪星人 2024-11-17 06:31:18

扩展菲尔已经提供的答案。如果您使用 xargs 进行调用,则在 bash 中添加并行性是轻而易举的事。

这里的代码:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' < url.lst

-n1:仅使用一个值(来自列表)作为curl调用的参数

-P10:随时保持10个curl进程处于活动状态(即10并行连接)

检查curl手册中的write_out参数以获取可以使用它提取的更多数据(时间等)。

如果它对某人有帮助,这就是我当前正在使用的调用:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective};%{http_code};%{time_total};%{time_namelookup};%{time_connect};%{size_download};%{speed_download}\n' < url.lst | tee results.csv

它只是将一堆数据输出到 csv 文件中,该文件可以导入到任何办公工具中。

Extending the answer already provided by Phil. Adding parallelism to it is a no brainer in bash if you use xargs for the call.

Here the code:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' < url.lst

-n1: use just one value (from the list) as argument to the curl call

-P10: Keep 10 curl processes alive at any time (i.e. 10 parallel connections)

Check the write_out parameter in the manual of curl for more data you can extract using it (times, etc).

In case it helps someone this is the call I'm currently using:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective};%{http_code};%{time_total};%{time_namelookup};%{time_connect};%{size_download};%{speed_download}\n' < url.lst | tee results.csv

It just outputs a bunch of data into a csv file that can be imported into any office tool.

花之痕靓丽 2024-11-17 06:31:18

这依赖于广泛使用的 wget,它几乎无处不在,甚至在 Alpine Linux 上也是如此。

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'

解释如下:

--quiet

关闭Wget的输出。

来源 - wget 手册页

--spider

[ ... ] 它不会下载页面,只需检查它们是否存在。 [...]

来源 - wget 手册页

--server-response

打印 HTTP 服务器发送的标头和 FTP 服务器发送的响应。

来源 - wget 手册页

关于 --server-response 他们没有说的是这些标头输出被打印到 标准错误 (sterr),因此需要 重定向 到标准输入。

输出发送到标准输入,我们可以将其通过管道传输到 awk 来提取 HTTP 状态代码。该代码是:

  • 第二个 ($2) 非空白字符组:{$2}
  • 标题第一行的 :NR==1< /code>

因为我们想打印它... {print $2}

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'

This relies on widely available wget, present almost everywhere, even on Alpine Linux.

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'

The explanations are as follow :

--quiet

Turn off Wget's output.

Source - wget man pages

--spider

[ ... ] it will not download the pages, just check that they are there. [ ... ]

Source - wget man pages

--server-response

Print the headers sent by HTTP servers and responses sent by FTP servers.

Source - wget man pages

What they don't say about --server-response is that those headers output are printed to standard error (sterr), thus the need to redirect to stdin.

The output sent to standard input, we can pipe it to awk to extract the HTTP status code. That code is :

  • the second ($2) non-blank group of characters: {$2}
  • on the very first line of the header: NR==1

And because we want to print it... {print $2}.

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'
不语却知心 2024-11-17 06:31:18

使用 curl 仅获取 HTTP 标头(而不是整个文件)并解析它:

$ curl -I  --stderr /dev/null http://www.google.co.uk/index.html | head -1 | cut -d' ' -f2
200

Use curl to fetch the HTTP-header only (not the whole file) and parse it:

$ curl -I  --stderr /dev/null http://www.google.co.uk/index.html | head -1 | cut -d' ' -f2
200
七堇年 2024-11-17 06:31:18

wget -S -i *file* 将为您获取文件中每个网址的标头。

通过 grep 专门过滤状态代码。

wget -S -i *file* will get you the headers from each url in a file.

Filter though grep for the status code specifically.

潇烟暮雨 2024-11-17 06:31:18

我发现了一个用 Python 编写的工具“webchk”。返回 url 列表的状态代码。
https://pypi.org/project/webchk/

输出如下所示:

▶ webchk -i ./dxieu.txt | grep '200'
http://salesforce-case-status.dxi.eu/login ... 200 OK (0.108)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.389)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.401)

希望有帮助!

I found a tool "webchk” written in Python. Returns a status code for a list of urls.
https://pypi.org/project/webchk/

Output looks like this:

▶ webchk -i ./dxieu.txt | grep '200'
http://salesforce-case-status.dxi.eu/login ... 200 OK (0.108)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.389)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.401)

Hope that helps!

野鹿林 2024-11-17 06:31:18

请记住,curl 并不总是可用(特别是在容器中),此解决方案存在问题:

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'

即使 URL 不存在,它也会返回退出状态 0。

或者,这里是使用 wget 的合理容器运行状况检查:

wget -S --spider -q -t 1 "${url}" 2>&1 | grep "200 OK" > /dev/null

虽然它可能无法为您提供确切的状态,但它至少会为您提供基于有效退出代码的运行状况响应(即使在端点上进行重定向)。

Keeping in mind that curl is not always available (particularly in containers), there are issues with this solution:

wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'

which will return exit status of 0 even if the URL doesn't exist.

Alternatively, here is a reasonable container health-check for using wget:

wget -S --spider -q -t 1 "${url}" 2>&1 | grep "200 OK" > /dev/null

While it may not give you exact status out, it will at least give you a valid exit code based health responses (even with redirects on the endpoint).

眼泪淡了忧伤 2024-11-17 06:31:18

由于 https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-P< /a> (xargs 中并行作业的输出存在混合风险),我会使用 GNU并行而不是 xargs 进行并行化:

cat url.lst |
  parallel -P0 -q curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' > outfile

在这种特殊情况下,使用 xargs 可能是安全的,因为输出很短,因此使用 xargs 会出现问题code> 更确切地说,如果有人后来更改代码来做更大的事情,它将不再安全。或者,如果有人读到这个问题并认为他可以用其他东西替换 curl ,那么这也可能不安全。

示例url.lst

https://fsfe.org
https://www.fsf.org/bulletin/2010/fall/gnu-parallel-a-design-for-life
https://www.fsf.org/blogs/community/who-actually-reads-the-code
https://publiccode.eu/

Due to https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-P (output from parallel jobs in xargs risks being mixed), I would use GNU Parallel instead of xargs to parallelize:

cat url.lst |
  parallel -P0 -q curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' > outfile

In this particular case it may be safe to use xargs because the output is so short, so the problem with using xargs is rather that if someone later changes the code to do something bigger, it will no longer be safe. Or if someone reads this question and thinks he can replace curl with something else, then that may also not be safe.

Example url.lst:

https://fsfe.org
https://www.fsf.org/bulletin/2010/fall/gnu-parallel-a-design-for-life
https://www.fsf.org/blogs/community/who-actually-reads-the-code
https://publiccode.eu/
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文