仅使用 sed 或 awk 从 html 页面提取 url 的最简单方法
我想从 html 文件的锚标记中提取 URL。 这需要在 BASH 中使用 SED/AWK 来完成。请不要使用 Perl。
做到这一点最简单的方法是什么?
I want to extract the URL from within the anchor tags of an html file.
This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(18)
您也可以执行类似的操作(前提是您安装了 lynx)...
Lynx 版本 < 2.8.8
Lynx 版本 >= 2.8.8(由 @condit 提供)
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
Lynx versions >= 2.8.8 (courtesy of @condit)
您自找的:
这是一个粗糙的工具,因此所有有关尝试使用正则表达式解析 HTML 的常见警告都适用。
You asked for it:
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
之后如果你只想查看本地页面,那么就没有http,但是
相对路径。
两个 sed 都会在一行上为您提供每个 url,但是存在垃圾,因此
after if you want to look only on local pages, so no http, but
relative path.
Both seds will give you each url on a single line, but there is garbage, so
使用 Xidel - HTML/XML 数据提取工具,可以通过以下方式完成此操作:
转换为绝对值网址:
With the Xidel - HTML/XML data extraction tool, this can be done via:
With conversion to absolute URLs:
我对 Greg Bacon 解决方案做了一些更改
这解决了两个问题:
I made a few changes to Greg Bacon Solution
This fixes two problems:
举个例子,因为你没有提供任何样本
An example, since you didn't provide any sample
您可以使用以下正则表达式轻松完成此操作,该正则表达式非常适合查找 URL:
我从 John Gruber 关于如何在文本中查找 URL 的文章。
这样您就可以找到文件 f.html 中的所有 URL,如下所示:
You can do it quite easily with the following regex, which is quite good at finding URLs:
I took it from John Gruber's article on how to find URLs in text.
That lets you find all URLs in a file f.html as follows:
我假设您想从某些 HTML 文本中提取 URL,而不是解析 HTML(正如评论之一所建议的那样)。不管你相信与否,有人已经做到了这一点。
OT:sed 网站有很多很多的好信息和许多有趣/疯狂的 sed脚本。您甚至可以玩Sokoban 在 sed 中!
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!
这是我的第一篇文章,所以我尽力解释为什么我发布这个答案...
帖子明确指出“仅使用 sed 或 awk”。
点,因为在 grep 中使用 PERL 正则表达式。
必需)在 BASH 中执行此操作。
所以这里是 GNU grep 2.28 中最简单的脚本:
关于
\K
开关,在 MAN 和 INFO 页面中没有找到 info,所以我来了 这里寻找答案......\K
开关删除之前的字符(以及密钥本身)。请记住遵循手册页中的建议:
“这是高度实验性的,grep -P 可能会警告未实现的功能。”
当然,您可以修改脚本来满足您的口味或需求,但我发现它非常适合帖子中的要求,也适合我们许多人......
我希望大家觉得它非常有用。
谢谢!!!
This is my first post, so I try to do my best explaining why I post this answer...
post explicitly says "using sed or awk only".
point, and because use PERL regex inside grep.
required ) to do it in BASH.
So here come the simplest script from GNU grep 2.28:
About the
\K
switch , not info was founded in MAN and INFO pages, so I came here for the answer....the
\K
switch get rid the previous chars ( and the key itself ).Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."
Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...
I hope folks you find it very useful.
thanks!!!
在 bash 中,以下内容应该有效。请注意,它不使用 sed 或 awk,而是使用
tr
和grep
,两者都非常标准,而不是 perl ;-):
例如
In bash, the following should work. Note that it doesn't use sed or awk, but uses
tr
andgrep
, both very standard and not perl ;-)for example:
generates
扩展 kerkael 的答案:
我添加的第一个 grep 删除了本地书签的链接。
第二个删除与上层的相对链接。
第三个删除不以 http 开头的链接。
根据您的具体要求选择使用其中之一。
Expanding on kerkael's answer:
The first grep I added removes links to local bookmarks.
The second removes relative links to upper levels.
The third removes links that don't start with http.
Pick and choose which one of these you use as per your specific requirements.
进行第一遍,用换行符 (
\n
http) 替换 url 的开头 (http)。然后你就可以保证你的链接从该行的开头开始,并且是该行中唯一的 URL。其余的应该很简单,这里有一个示例:
sed "s/http /\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
别名 lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'
Go over with a first pass replacing the start of the urls (http) with a newline (
\n
http). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.The rest should be easy, here is an example:
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'
避开
awk
/sed
要求:urlextract
就是为了这样的任务而设计的(文档)。urlview
是一个交互式 CLI 解决方案 (github repo)。Eschewing the
awk
/sed
requirement:urlextract
is made just for such a task (documentation).urlview
is an interactive CLI solution (github repo).您可以尝试:
You can try:
这就是我为了更好地查看而尝试的方法,创建 shell 文件并提供链接作为参数,它将创建 temp2.txt 文件。
That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.
我专门使用 Bash 抓取网站来验证客户端链接的 http 状态并向他们报告发现的错误。我发现 awk 和 sed 是最快且最容易理解的。支持OP。
因为 sed 在单行上工作,这将确保所有 url 在新行上正确格式化,包括任何相对 url。第一个 sed 查找所有 href 和 src 属性,并将每个属性放在一个新行上,同时删除该行的其余部分,包括链接末尾的结束双引号 (")。
注意,我使用了波形符 (~)在 sed 中作为替换的定义分隔符。在使用 html 时,正斜杠可能会混淆 sed 替换,
输出它。
并 如果内容格式正确,则可以使用 awk 或 sed 来收集这些链接的任何子集,例如,您可能不需要 base64 图像,而是需要所有其他图像:
提取子集后,只需删除 href=" 或 src="
这种方法非常快,我在 Bash 函数中使用这些方法来格式化数千个抓取页面的结果,以便那些希望有人一次性查看整个网站的客户。
I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.
Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.
Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.
The awk finds any line that begins with href or src and outputs it.
Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:
Once the subset is extracted, just remove the href=" or src="
This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.
Lynx 为我完成了这项工作,
lynx -dump -listonly -nonumbers bookmarks_10_10_23.html
但与其他响应不同,它不显示数字,您可以将生成的 URL 提供给另一个进程,例如:
选项的含义:
Lynx did the job for me,
lynx -dump -listonly -nonumbers bookmarks_10_10_23.html
but unlike the other response this doesn't display the numbers and you can feed the resulting URLs to another process for example:
what the options mean:
例子:
Example: