使用 unix 命令从包含模式的 html 文件中删除字符串
我有一个凌乱的 html,如下所示:
<div id=":0.page.0" class="page-element" style="width: 1620px;">
<div>
<img src="viewer_files/viewer_004.png" class="page-image" style="width: 800px; height: 1131px; display: none;">
<img src="viewer_files/viewer_005.png" class="page-image" style="width: 1600px;">
</div>
</div>// this repeats 100+ times with different 'src' attributes
现在这实际上都是一行(为了便于阅读,我已将其格式化为多行)。我正在尝试删除在内联 css 中设置了 display:none;
的所有 标签。是否可以使用 sed/awk 或其他一些 unix 命令来实现此目的?我认为如果它是一个缩进良好的 html 文档,那就很容易了。
I have a messy html that looks like this:
<div id=":0.page.0" class="page-element" style="width: 1620px;">
<div>
<img src="viewer_files/viewer_004.png" class="page-image" style="width: 800px; height: 1131px; display: none;">
<img src="viewer_files/viewer_005.png" class="page-image" style="width: 1600px;">
</div>
</div>// this repeats 100+ times with different 'src' attributes
Now this is all one line actually (i have formatted in multiple lines for easy readibility). I am trying to remove all <img>
tags that have display:none;
set in the inline css. Is it possible to use sed/awk or some other unix command to achieve this? I think if it were a well indented html document, it would've been easy.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
sed 有几个命令,但大多数人只学习替代命令:“s”。
一个有用的命令会删除与限制匹配的每一行:“d”。
小心它删除整行。
Sed has several commands, but most people only learn the substitute command: "s".
A useful command deletes every line that matches the restriction: "d".
Be carreful it's delete entire line.
这样就可以了
That would do it
关于 sed 的快速解释:
s 代表替换
/ 是分隔符
s 表示第一个字段将是要搜索的模式,将被第二个字段替换。最后一项是选项。
g 表示全局(如果找到很多匹配项,请多次替换)。
就地替换: sed -i -e "..."
A quick explanation about sed :
s stands for substitution
/ are delimiters
s means that the first field will be a pattern to be search, that will be replaced by the second one. The last one are options.
g means global (replace it many times if many matches are found).
to replace inplace : sed -i -e "..."
我会使用 Twig 或 XMLStarlet 来做这种处理。比 sed/awk/grep 可靠得多。由于您的模式是有规律且重复的,因此它们也会起作用。
I would use either Twig or XMLStarlet to do this kind of processing. A lot more reliable than sed/awk/grep. Since your pattern is regular and repeating, they would work too.
HTML 和正则表达式是出了名的不匹配,因此您可能需要 HTML 感知的东西。我可能会选择类似 TagSoup 的东西,但毫无疑问还有其他对 shell 更友好的选项,或者适合您可能拥有的任何最喜欢的脚本语言的选项。
HTML and regexes are a notoriously bad match, so you probably want something that is HTML-aware. I'd probably go for something like TagSoup, but there are no doubt other options that are more shell-friendly, or suitable for any favourite scripting language you may have.