当前位置：文江博客话题详情

仅使用 sed 或 awk 从 html 页面提取 url 的最简单方法

发布于 2024-08-14 09:28:25 字数 92 浏览 3 评论 0原文

我想从 html 文件的锚标记中提取 URL。这需要在 BASH 中使用 SED/AWK 来完成。请不要使用 Perl。

做到这一点最简单的方法是什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泪眸﹌ 2024-08-21 09:28:25

您也可以执行类似的操作（前提是您安装了 lynx）...

Lynx 版本 < 2.8.8

lynx -dump -listonly my.html

Lynx 版本 >= 2.8.8（由 @condit 提供）

lynx -dump -hiddenlinks=listonly my.html

You could also do something like this (provided you have lynx installed)...

Lynx versions < 2.8.8

lynx -dump -listonly my.html

Lynx versions >= 2.8.8 (courtesy of @condit)

lynx -dump -hiddenlinks=listonly my.html

回复收藏 0 原文

俏︾媚 2024-08-21 09:28:25

您自找的：

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

这是一个粗糙的工具，因此所有有关尝试使用正则表达式解析 HTML 的常见警告都适用。

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

回复收藏 0 原文

完美的未来在梦里 2024-08-21 09:28:25

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq

第一个 grep 查找包含 url 的行。您可以添加更多元素
之后如果你只想查看本地页面，那么就没有http，但是
相对路径。
第一个 sed 将在每个 a href url 标签前面添加一个换行符，并带有 \n
第二个 sed 将在该行中的第二个 " 后面通过将其替换为 /a 来缩短每个 url带有换行符的 标签
两个 sed 都会在一行上为您提供每个 url，但是存在垃圾，因此
第二个 grep href 会清理混乱
排序和 uniq 将为您提供 sourcepage.html 中存在的每个现有 url 的一个实例

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq

The first grep looks for lines containing urls. You can add more elements
after if you want to look only on local pages, so no http, but
relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline
Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html

回复收藏 0 原文

oО清风挽发oО 2024-08-21 09:28:25

使用 Xidel - HTML/XML 数据提取工具，可以通过以下方式完成此操作：

$ xidel --extract "//a/@href" http://example.com/

转换为绝对值网址：

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

回复收藏 0 原文

硪扪都還晓 2024-08-21 09:28:25

我对 Greg Bacon 解决方案做了一些更改

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

这解决了两个问题：

我们正在匹配锚点不以 href 作为第一个属性开头的情况
我们正在讨论在同一行中有多个锚点的可能性

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line

回复收藏 0 原文

短暂陪伴 2024-08-21 09:28:25

举个例子，因为你没有提供任何样本

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html

An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html

回复收藏 0 原文

得不到的就毁灭 2024-08-21 09:28:25

您可以使用以下正则表达式轻松完成此操作，该正则表达式非常适合查找 URL：

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

我从 John Gruber 关于如何在文本中查找 URL 的文章。

这样您就可以找到文件 f.html 中的所有 URL，如下所示：

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

回复收藏 0 原文

刘备忘录 2024-08-21 09:28:25

我假设您想从某些 HTML 文本中提取 URL，而不是解析 HTML（正如评论之一所建议的那样）。不管你相信与否，有人已经做到了这一点。

OT：sed 网站有很多很多的好信息和许多有趣/疯狂的 sed脚本。您甚至可以玩 Sokoban 在 sed 中！

回复收藏 0 原文

谁的新欢旧爱 2024-08-21 09:28:25

这是我的第一篇文章，所以我尽力解释为什么我发布这个答案...

由于前 7 个投票最多的答案，其中 4 个包含 GREP，即使
帖子明确指出“仅使用 sed 或 awk”。
即使帖子要求“No perl please”，由于之前的
点，因为在 grep 中使用 PERL 正则表达式。
因为这是最简单的方法（据我所知，并且是
必需）在 BASH 中执行此操作。

所以这里是 GNU grep 2.28 中最简单的脚本：

grep -Po 'href="\K.*?(?=")'

关于 \K 开关，在 MAN 和 INFO 页面中没有找到 info，所以我来了这里寻找答案......
\K 开关删除之前的字符（以及密钥本身）。
请记住遵循手册页中的建议：
“这是高度实验性的，grep -P 可能会警告未实现的功能。”

当然，您可以修改脚本来满足您的口味或需求，但我发现它非常适合帖子中的要求，也适合我们许多人......

我希望大家觉得它非常有用。

谢谢！！！

This is my first post, so I try to do my best explaining why I post this answer...

Since the first 7 most voted answers, 4 include GREP even when the
post explicitly says "using sed or awk only".
Even when the post requires "No perl please", due to the previous
point, and because use PERL regex inside grep.
and because this is the simplest way ( as far I know , and was
required ) to do it in BASH.

So here come the simplest script from GNU grep 2.28:

grep -Po 'href="\K.*?(?=")'

About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."

Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...

I hope folks you find it very useful.

thanks!!!

回复收藏 0 原文

￡冰雨忧蓝° 2024-08-21 09:28:25

在 bash 中，以下内容应该有效。请注意，它不使用 sed 或 awk，而是使用 tr 和 grep，两者都非常标准，而不是 perl ;-)

$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

：

$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

例如

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...

In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)

$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

for example:

$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

generates

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...

回复收藏 0 原文

最笨的告白 2024-08-21 09:28:25

扩展 kerkael 的答案：

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
# now adding some more
  |grep -v "<a href=\"#"
  |grep -v "<a href=\"../"
  |grep -v "<a href=\"http"

我添加的第一个 grep 删除了本地书签的链接。

第二个删除与上层的相对链接。

第三个删除不以 http 开头的链接。

根据您的具体要求选择使用其中之一。

Expanding on kerkael's answer:

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
# now adding some more
  |grep -v "<a href=\"#"
  |grep -v "<a href=\"../"
  |grep -v "<a href=\"http"

The first grep I added removes links to local bookmarks.

The second removes relative links to upper levels.

The third removes links that don't start with http.

Pick and choose which one of these you use as per your specific requirements.

回复收藏 0 原文

原来分手还会想你 2024-08-21 09:28:25

进行第一遍，用换行符 (\nhttp) 替换 url 的开头 (http)。然后你就可以保证你的链接从该行的开头开始，并且是该行中唯一的 URL。

其余的应该很简单，这里有一个示例：

sed "s/http /\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/$^http[s]*:[a-Z0-9/.=?_-]*$$.*$/\1/p"

别名 lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/$^http[s]*:[a-Z0-9/.=?_-]*$$.*$/\1/p"; }; _'

回复收藏 0 原文

恰似旧人归 2024-08-21 09:28:25

避开 awk/sed 要求：

urlextract 就是为了这样的任务而设计的（文档）。
urlview 是一个交互式 CLI 解决方案 (github repo)。

回复收藏 0 原文

掩于岁月 2024-08-21 09:28:25

您可以尝试：

curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'

You can try:

curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'

回复收藏 0 原文

笑脸一如从前 2024-08-21 09:28:25

这就是我为了更好地查看而尝试的方法，创建 shell 文件并提供链接作为参数，它将创建 temp2.txt 文件。

a=$1

lynx -listonly -dump "$a" > temp

awk 'FNR > 2 {print$2}' temp > temp2.txt

rm temp

>sh test.sh http://link.com

That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.

a=$1

lynx -listonly -dump "$a" > temp

awk 'FNR > 2 {print$2}' temp > temp2.txt

rm temp

>sh test.sh http://link.com

回复收藏 0 原文

长亭外，古道边 2024-08-21 09:28:25

我专门使用 Bash 抓取网站来验证客户端链接的 http 状态并向他们报告发现的错误。我发现 awk 和 sed 是最快且最容易理解的。支持OP。

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'

因为 sed 在单行上工作，这将确保所有 url 在新行上正确格式化，包括任何相对 url。第一个 sed 查找所有 href 和 src 属性，并将每个属性放在一个新行上，同时删除该行的其余部分，包括链接末尾的结束双引号 (")。

注意，我使用了波形符 (~)在 sed 中作为替换的定义分隔符。在使用 html 时，正斜杠可能会混淆 sed 替换，

输出它。

并如果内容格式正确，则可以使用 awk 或 sed 来收集这些链接的任何子集，例如，您可能不需要 base64 图像，而是需要所有其他图像：

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'

提取子集后，只需删除 href=" 或 src="

sed -r 's~(href="|src=")~~g'

这种方法非常快，我在 Bash 函数中使用这些方法来格式化数千个抓取页面的结果，以便那些希望有人一次性查看整个网站的客户。

I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'

Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.

Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.

The awk finds any line that begins with href or src and outputs it.

Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'

Once the subset is extracted, just remove the href=" or src="

sed -r 's~(href="|src=")~~g'

This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.

回复收藏 0 原文

尛丟丟 2024-08-21 09:28:25

Lynx 为我完成了这项工作，
lynx -dump -listonly -nonumbers bookmarks_10_10_23.html

但与其他响应不同，它不显示数字，您可以将生成的 URL 提供给另一个进程，例如：

 lynx -dump -listonly -nonumbers bookmarks_10_10_23.html | xargs -L1 -i{} wayback_machine_downloader -e "{}"

选项的含义：

-dump  dumps  the  formatted  output  of  the  default document or those specified on the command line to
       standard output.  Unlike interactive mode, all documents are processed.  This can be used  in  the
       following way:

       lynx -dump http://www.subir.com/lynx.html
-listonly
       for -dump, show only the list of links.
-nonumbers
       disable link- and field-numbering.  This overrides -number_fields and -number_links.

Lynx did the job for me,
lynx -dump -listonly -nonumbers bookmarks_10_10_23.html

but unlike the other response this doesn't display the numbers and you can feed the resulting URLs to another process for example:

 lynx -dump -listonly -nonumbers bookmarks_10_10_23.html | xargs -L1 -i{} wayback_machine_downloader -e "{}"

what the options mean:

-dump  dumps  the  formatted  output  of  the  default document or those specified on the command line to
       standard output.  Unlike interactive mode, all documents are processed.  This can be used  in  the
       following way:

       lynx -dump http://www.subir.com/lynx.html
-listonly
       for -dump, show only the list of links.
-nonumbers
       disable link- and field-numbering.  This overrides -number_fields and -number_links.

回复收藏 0 原文

自由如风 2024-08-21 09:28:25

$ curl -ks <url> | awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'

例子：

$ curl -ks https://stackoverflow.com|awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'|sort -u | head -20
#
/
/collectives
/feeds
/help
/opensearch.xml
/questions
/tags
/teams/integrations/github
/teams/integrations/jira
/teams/integrations/microsoft-teams
/teams/integrations/okta
/teams/integrations/slack
/users
/users/signup?ssrc=product_home
https://ai.stackexchange.com
https://api.stackexchange.com/
https://apple.stackexchange.com
https://askubuntu.com/
https://cdn.sstatic.net/Shared/Channels/channels.css?v=a4d77abedec3

$ curl -ks <url> | awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'

Example:

$ curl -ks https://stackoverflow.com|awk -F'"' -v RS='[ >]' '/^href=/ && NF>2{print $2}'|sort -u | head -20
#
/
/collectives
/feeds
/help
/opensearch.xml
/questions
/tags
/teams/integrations/github
/teams/integrations/jira
/teams/integrations/microsoft-teams
/teams/integrations/okta
/teams/integrations/slack
/users
/users/signup?ssrc=product_home
https://ai.stackexchange.com
https://api.stackexchange.com/
https://apple.stackexchange.com
https://askubuntu.com/
https://cdn.sstatic.net/Shared/Channels/channels.css?v=a4d77abedec3

回复收藏 0 原文

~没有更多了~