又一个贪心的 sed 问题
我正在使用 html 框架源自动下载许多图像。非常好,Sed,wget。框架源的示例:
<td width="25%" align="center" valign="top"><a href="images/display.htm?concept_Core.jpg"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></a></td>
所以我这样做:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
得到看起来像这样的部分:
concept_Core.jpg
然后这样做:
wget --base=/some/url/concept_Core.jpg
但是有一行令人讨厌的行。显然,该行是网站中的一个错误,或者其他任何可能的错误,但它是错误的,但是我无法更改它。 ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
也就是说,其中两个“concept_Frigate16.jpg”排成一行。我的剧本让我
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
明白为什么。 Sed 是贪婪的,这一点在本例中明显表现出来。
现在的问题是,我该如何摆脱这个极端情况?也就是说,使其非贪婪并使其停在第一个 .jpg?强调文本
I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:
<td width="25%" align="center" valign="top"><a href="images/display.htm?concept_Core.jpg"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></a></td>
So I do this:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
to get the part which looks like this:
concept_Core.jpg
to do then this:
wget --base=/some/url/concept_Core.jpg
But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
You understand why. Sed is greedy and this obviously shows up in this case.
Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用perl:
use perl:
您可能需要考虑将: 更改
为:
这应该会阻止您的初始搜索超出第一个
href
的末尾。鉴于我不知道完整的输入集,这是否会引入其他问题(对于其他边缘情况)有点困难。如果是这样,您可能希望选择使用真正的解析器而不是正则表达式。正则表达式是一个强大的工具,但它们不一定适合所有情况。
You might want to consider changing:
into:
This should stop your initial search going beyond the end of the first
href
. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.
在正则表达式中使用 [^"] 代替 . 。
这将选择除撇号之外的所有字符。
Use [^"] instead of . in the regular expression.
This will pick all characters except the appostrophes.
sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'
sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'
GNU grep 可以做 PCRE:
GNU grep can do PCRE: