又一个贪心的 sed 问题

发布于 2024-10-02 05:45:50 字数 1332 浏览 0 评论 0原文

我正在使用 html 框架源自动下载许多图像。非常好，Sed，wget。框架源的示例：

<td width="25%" align="center" valign="top"><a href="images/display.htm?concept_Core.jpg"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></a></td>

所以我这样做：

sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm

得到看起来像这样的部分：

concept_Core.jpg

然后这样做：

wget --base=/some/url/concept_Core.jpg

但是有一行令人讨厌的行。显然，该行是网站中的一个错误，或者其他任何可能的错误，但它是错误的，但是我无法更改它。 ;)

<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>

也就是说，其中两个“concept_Frigate16.jpg”排成一行。我的剧本让我

concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg

明白为什么。 Sed 是贪婪的，这一点在本例中明显表现出来。

现在的问题是，我该如何摆脱这个极端情况？也就是说，使其非贪婪并使其停在第一个 .jpg?强调文本

原文

I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:

<td width="25%" align="center" valign="top"><a href="images/display.htm?concept_Core.jpg"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></a></td>

So I do this:

sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm

to get the part which looks like this:

concept_Core.jpg

to do then this:

wget --base=/some/url/concept_Core.jpg

But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)

<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>

That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me

concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg

You understand why. Sed is greedy and this obviously shows up in this case.

Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白色秋天 2024-10-09 05:45:50

使用perl：

perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'

use perl:

perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'

回复收藏 0 原文

明月松间行 2024-10-09 05:45:50

您可能需要考虑将: 更改

\(.*jpg\)

为:

\([^"]*jpg\)

这应该会阻止您的初始搜索超出第一个 href 的末尾。鉴于我不知道完整的输入集，这是否会引入其他问题（对于其他边缘情况）有点困难。

如果是这样，您可能希望选择使用真正的解析器而不是正则表达式。正则表达式是一个强大的工具，但它们不一定适合所有情况。

You might want to consider changing:

\(.*jpg\)

into:

\([^"]*jpg\)

This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.

If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.

回复收藏 0 原文

悍妇囚夫 2024-10-09 05:45:50

在正则表达式中使用 [^"] 代替 . 。
这将选择除撇号之外的所有字符。

回复收藏 0 原文

站稳脚跟 2024-10-09 05:45:50

sed -n -e 's/^.*htm?$[^"]*jpg$.*$/\1/p'

回复收藏 0 原文

栀梦 2024-10-09 05:45:50

GNU grep 可以做 PCRE：

grep -Po '(?<=\.htm\?).*?jpg' concept.htm

GNU grep can do PCRE:

grep -Po '(?<=\.htm\?).*?jpg' concept.htm

回复收藏 0 原文

~没有更多了~

关于作者

等风来

暂无简介

0 文章

0 评论

678 人气

关注发私信

友情链接

文江博客

又一个贪心的 sed 问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

yangzhenyu123

lvzun

执笔绘流年

芯好空

始于初秋

谁与争疯

友情链接

又一个贪心的 sed 问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

yangzhenyu123

lvzun

执笔绘流年

芯好空

始于初秋

谁与争疯

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。