仅使用 sed 或 awk 从 html 页面提取 url 的最简单方法

发布于 2024-08-14 09:28:25 字数 92 浏览 3 评论 0原文

我想从 html 文件的锚标记中提取 URL。 这需要在 BASH 中使用 SED/AWK 来完成。请不要使用 Perl。

做到这一点最简单的方法是什么?

I want to extract the URL from within the anchor tags of an html file.
This needs to be done in BASH using SED/AWK. No perl please.

What is the easiest way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(18

泪眸﹌ 2024-08-21 09:28:25

您也可以执行类似的操作(前提是您安装了 lynx)...

Lynx 版本 < 2.8.8

lynx -dump -listonly my.html

Lynx 版本 >= 2.8.8(由 @condit 提供)

lynx -dump -hiddenlinks=listonly my.html

You could also do something like this (provided you have lynx installed)...

Lynx versions < 2.8.8

lynx -dump -listonly my.html

Lynx versions >= 2.8.8 (courtesy of @condit)

lynx -dump -hiddenlinks=listonly my.html
俏︾媚 2024-08-21 09:28:25

您自找的:

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

这是一个粗糙的工具,因此所有有关尝试使用正则表达式解析 HTML 的常见警告都适用。

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

完美的未来在梦里 2024-08-21 09:28:25
grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
  1. 第一个 grep 查找包含 url 的行。您可以添加更多元素
    之后如果你只想查看本地页面,那么就没有http,但是
    相对路径。
  2. 第一个 sed 将在每个 a href url 标签前面添加一个换行符,并带有 \n
  3. 第二个 sed 将在该行中的第二个 " 后面通过将其替换为 /a 来缩短每个 url带有换行符的 标签
    两个 sed 都会在一行上为您提供每个 url,但是存在垃圾,因此
  4. 第二个 grep href 会清理混乱
  5. 排序和 uniq 将为您提供 sourcepage.html 中存在的每个现有 url 的一个实例
grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
  1. The first grep looks for lines containing urls. You can add more elements
    after if you want to look only on local pages, so no http, but
    relative path.
  2. The first sed will add a newline in front of each a href url tag with the \n
  3. The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline
    Both seds will give you each url on a single line, but there is garbage, so
  4. The 2nd grep href cleans the mess up
  5. The sort and uniq will give you one instance of each existing url present in the sourcepage.html
oО清风挽发oО 2024-08-21 09:28:25

使用 Xidel - HTML/XML 数据提取工具,可以通过以下方式完成此操作:

$ xidel --extract "//a/@href" http://example.com/

转换为绝对值网址:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/
硪扪都還晓 2024-08-21 09:28:25

我对 Greg Bacon 解决方案做了一些更改

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

这解决了两个问题:

  1. 我们正在匹配锚点不以 href 作为第一个属性开头的情况
  2. 我们正在讨论在同一行中有多个锚点的可能性

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

  1. We are matching cases where the anchor doesn't start with href as first attribute
  2. We are covering the possibility of having several anchors in the same line
短暂陪伴 2024-08-21 09:28:25

举个例子,因为你没有提供任何样本

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html

An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html
得不到的就毁灭 2024-08-21 09:28:25

您可以使用以下正则表达式轻松完成此操作,该正则表达式非常适合查找 URL:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

我从 John Gruber 关于如何在文本中查找 URL 的文章

这样您就可以找到文件 f.html 中的所有 URL,如下所示:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
刘备忘录 2024-08-21 09:28:25

我假设您想从某些 HTML 文本中提取 URL,而不是解析 HTML(正如评论之一所建议的那样)。不管你相信与否,有人已经做到了这一点

OT:sed 网站有很多很多的好信息和许多有趣/疯狂的 sed脚本。您甚至可以Sokoban 在 sed 中!

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

谁的新欢旧爱 2024-08-21 09:28:25

这是我的第一篇文章,所以我尽力解释为什么我发布这个答案...

  1. 由于前 7 个投票最多的答案,其中 4 个包含 GREP,即使
    帖子明确指出“仅使用 sed 或 awk”。
  2. 即使帖子要求“No perl please”,由于之前的
    点,因为在 grep 中使用 PERL 正则表达式。
  3. 因为这是最简单的方法(据我所知,并且是
    必需)在 BASH 中执行此操作。

所以这里是 GNU grep 2.28 中最简单的脚本:

grep -Po 'href="\K.*?(?=")'

关于 \K 开关,在 MAN 和 INFO 页面中没有找到 info,所以我来了 这里寻找答案......
\K 开关删除之前的字符(以及密钥本身)。
请记住遵循手册页中的建议:
“这是高度实验性的,grep -P 可能会警告未实现的功能。”

当然,您可以修改脚本来满足您的口味或需求,但我发现它非常适合帖子中的要求,也适合我们许多人......

我希望大家觉得它非常有用。

谢谢!!!

This is my first post, so I try to do my best explaining why I post this answer...

  1. Since the first 7 most voted answers, 4 include GREP even when the
    post explicitly says "using sed or awk only".
  2. Even when the post requires "No perl please", due to the previous
    point, and because use PERL regex inside grep.
  3. and because this is the simplest way ( as far I know , and was
    required ) to do it in BASH.

So here come the simplest script from GNU grep 2.28:

grep -Po 'href="\K.*?(?=")'

About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."

Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...

I hope folks you find it very useful.

thanks!!!

£冰雨忧蓝° 2024-08-21 09:28:25

在 bash 中,以下内容应该有效。请注意,它不使用 sed 或 awk,而是使用 trgrep,两者都非常标准,而不是 perl ;-)

$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

例如

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...

In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)

$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

for example:

$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

generates

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
最笨的告白 2024-08-21 09:28:25

扩展 kerkael 的答案

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
# now adding some more
  |grep -v "<a href=\"#"
  |grep -v "<a href=\"../"
  |grep -v "<a href=\"http"

我添加的第一个 grep 删除了本地书签的链接。

第二个删除与上层的相对链接。

第三个删除不以 http 开头的链接。

根据您的具体要求选择使用其中之一。

Expanding on kerkael's answer:

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
# now adding some more
  |grep -v "<a href=\"#"
  |grep -v "<a href=\"../"
  |grep -v "<a href=\"http"

The first grep I added removes links to local bookmarks.

The second removes relative links to upper levels.

The third removes links that don't start with http.

Pick and choose which one of these you use as per your specific requirements.

原来分手还会想你 2024-08-21 09:28:25

进行第一遍,用换行符 (\nhttp) 替换 url 的开头 (http)。然后你就可以保证你的链接从该行的开头开始,并且是该行中唯一的 URL。

其余的应该很简单,这里有一个示例:

sed "s/http /\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"

别名 lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'

Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.

The rest should be easy, here is an example:

sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"

alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'

恰似旧人归 2024-08-21 09:28:25

避开 awk/sed 要求:

  1. urlextract 就是为了这样的任务而设计的(文档)。
  2. urlview 是一个交互式 CLI 解决方案 (github repo)。

Eschewing the awk/sed requirement:

  1. urlextract is made just for such a task (documentation).
  2. urlview is an interactive CLI solution (github repo).
掩于岁月 2024-08-21 09:28:25

您可以尝试:

curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'

You can try:

curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'
笑脸一如从前 2024-08-21 09:28:25

这就是我为了更好地查看而尝试的方法,创建 shell 文件并提供链接作为参数,它将创建 temp2.​​txt 文件。

a=$1

lynx -listonly -dump "$a" > temp

awk 'FNR > 2 {print$2}' temp > temp2.txt

rm temp

>sh test.sh http://link.com

That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.

a=$1

lynx -listonly -dump "$a" > temp

awk 'FNR > 2 {print$2}' temp > temp2.txt

rm temp

>sh test.sh http://link.com
长亭外,古道边 2024-08-21 09:28:25

我专门使用 Bash 抓取网站来验证客户端链接的 http 状态并向他们报告发现的错误。我发现 awk 和 sed 是最快且最容易理解的。支持OP。

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'

因为 sed 在单行上工作,这将确保所有 url 在新行上正确格式化,包括任何相对 url。第一个 sed 查找所有 href 和 src 属性,并将每个属性放在一个新行上,同时删除该行的其余部分,包括链接末尾的结束双引号 (")。

注意,我使用了波形符 (~)在 sed 中作为替换的定义分隔符。在使用 html 时,正斜杠可能会混淆 sed 替换,

输出它。

并 如果内容格式正确,则可以使用 awk 或 sed 来收集这些链接的任何子集,例如,您可能不需要 base64 图像,而是需要所有其他图像:

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'

提取子集后,只需删除 href=" 或 src="

sed -r 's~(href="|src=")~~g'

这种方法非常快,我在 Bash 函数中使用这些方法来格式化数千个抓取页面的结果,以便那些希望有人一次性查看整个网站的客户。

I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'

Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.

Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.

The awk finds any line that begins with href or src and outputs it.

Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'

Once the subset is extracted, just remove the href=" or src="

sed -r 's~(href="|src=")~~g'

This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.

尛丟丟 2024-08-21 09:28:25

Lynx 为我完成了这项工作,
lynx -dump -listonly -nonumbers bookmarks_10_10_23.html

但与其他响应不同,它不显示数字,您可以将生成的 URL 提供给另一个进程,例如:

 lynx -dump -listonly -nonumbers bookmarks_10_10_23.html | xargs -L1 -i{} wayback_machine_downloader -e "{}"

选项的含义:

-dump  dumps  the  formatted  output  of  the  default document or those specified on the command line to
       standard output.  Unlike interactive mode, all documents are processed.  This can be used  in  the
       following way:

       lynx -dump http://www.subir.com/lynx.html
-listonly
       for -dump, show only the list of links.
-nonumbers
       disable link- and field-numbering.  This overrides -number_fields and -number_links.

Lynx did the job for me,
lynx -dump -listonly -nonumbers bookmarks_10_10_23.html

but unlike the other response this doesn't display the numbers and you can feed the resulting URLs to another process for example:

 lynx -dump -listonly -nonumbers bookmarks_10_10_23.html | xargs -L1 -i{} wayback_machine_downloader -e "{}"

what the options mean:

-dump  dumps  the  formatted  output  of  the  default document or those specified on the command line to
       standard output.  Unlike interactive mode, all documents are processed.  This can be used  in  the
       following way:

       lynx -dump http://www.subir.com/lynx.html
-listonly
       for -dump, show only the list of links.
-nonumbers
       disable link- and field-numbering.  This overrides -number_fields and -number_links.
自由如风 2024-08-21 09:28:25
$ curl -ks <url> | awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'

例子:

$ curl -ks https://stackoverflow.com|awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'|sort -u | head -20
#
/
/collectives
/feeds
/help
/opensearch.xml
/questions
/tags
/teams/integrations/github
/teams/integrations/jira
/teams/integrations/microsoft-teams
/teams/integrations/okta
/teams/integrations/slack
/users
/users/signup?ssrc=product_home
https://ai.stackexchange.com
https://api.stackexchange.com/
https://apple.stackexchange.com
https://askubuntu.com/
https://cdn.sstatic.net/Shared/Channels/channels.css?v=a4d77abedec3
$ curl -ks <url> | awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'

Example:

$ curl -ks https://stackoverflow.com|awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'|sort -u | head -20
#
/
/collectives
/feeds
/help
/opensearch.xml
/questions
/tags
/teams/integrations/github
/teams/integrations/jira
/teams/integrations/microsoft-teams
/teams/integrations/okta
/teams/integrations/slack
/users
/users/signup?ssrc=product_home
https://ai.stackexchange.com
https://api.stackexchange.com/
https://apple.stackexchange.com
https://askubuntu.com/
https://cdn.sstatic.net/Shared/Channels/channels.css?v=a4d77abedec3
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文