Wget 页面标题

发布于 2025-01-06 19:01:57 字数 209 浏览 2 评论 0原文

是否可以从命令行获取页面的标题？

输入：

$ wget http://bit.ly/rQyhG5 <<code>>

输出：

If it’s broke, fix it right   - Keeping it Real Estate. Home

原文

Is it possible to Wget a page's title from the command line?

input:

$ wget http://bit.ly/rQyhG5 <<code>>

output:

If it’s broke, fix it right   - Keeping it Real Estate. Home

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无所谓啦 2025-01-13 19:01:57

该脚本将为您提供所需的内容：

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'

但是在很多情况下它都会中断，包括页面正文中是否有 ...，或者如果标题位于多行上。

这可能会好一点：

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
  | sed -e 's!.*<title>\(.*\)</title>.*!\1!'

但它不适合您的情况，因为您的页面包含以下头部开口：

<head profile="http://gmpg.org/xfn/11">

同样，这可能会更好：

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
  | sed -e 's!.*<title>\(.*\)</title>.*!\1!'

但仍然有方法可以打破它，包括页面中没有头部/标题。

同样，更好的解决方案可能是：

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
  | sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'

但我相信我们可以找到一种方法来打破它。这就是为什么真正的 xml 解析器是正确的解决方案，但由于您的问题被标记为 shell，所以上面的内容是我能提供的最好的解决方案。

paste 和 2 个 sed 可以合并在一个 sed 中，但可读性较差。然而，这个版本的优点是可以处理多行标题：

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'

更新：

正如注释中所解释的，上面的最后一个 sed 使用 T 命令，它是一个 GNU 扩展。如果您没有兼容版本，您可以使用：

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'

更新 2：

如上所述，在 Mac 上仍然无法工作，请尝试：

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'

和/或

cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -f script

（注意 \ 之前code>$ 以避免变量扩展。）

它似乎 :next 不喜欢以 $ 为前缀，这在某些情况下可能是一个问题sed 版本。

This script would give you what you need:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'

But there are lots of situations where it breaks, including if there is a <title>...</title> in the body of the page, or if the title is on more than one line.

This might be a little better:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
  | sed -e 's!.*<title>\(.*\)</title>.*!\1!'

but it does not fit your case as your page contains the following head opening:

<head profile="http://gmpg.org/xfn/11">

Again, this might be better:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
  | sed -e 's!.*<title>\(.*\)</title>.*!\1!'

but there is still ways to break it, including no head/title in the page.

Again, a better solution might be:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
  | sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'

but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell, the above it the best I can come with.

The paste and the 2 sed can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'

Update:

As explain in the comments, the last sed above uses the T command which is a GNU extension. If you do not have a compatible version, you can use:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'

Update 2:

As above still not working on Mac, try:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'

and/or

cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -f script

(Note the \ before the $ to avoid variable expansion.)

It seams that the :next does not like to be prefixed by a $, which could be a problem in some sed version.

回复收藏 0 原文

揽清风入怀 2025-01-13 19:01:57

以下内容将提取 lynx 认为页面标题是什么，从而使您免于所有正则表达式的废话。假设您正在检索的页面对于 lynx 来说足够符合标准，那么这应该不会中断。

lynx -dump example.com | sed '2q;d'

The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.