删除sed或类似中的html标签

发布于 2024-12-07 16:38:47 字数 504 浏览 0 评论 0原文

我正在尝试从网页中获取表的内容。我只是需要内容，但不需要标签。我什至不需要“tr”或“td”，只需要内容。例如：

<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>

我也想将第一列输出像这样放在一个新的 csv 文件中列1、信息1、信息2、信息3 coumn2,info1,info2,info3

我尝试使用 sed 删除模式但当我获取表格时，还有其他标签，例如 < ;color> 等所以我想要的是删除所有标签；简而言之，所有带有 < 的内容和> 。

原文

I am trying to fetch contents of table from a wepage. I jsut need the contents but not the tags <tr></tr>. I don't even need "tr" or "td" just the content. for eg:

<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>

also I would like to put the first column output like this in a new csv file
column1,info1,info2,info3
coumn2,info1,info2,info3

I tried sed to deleted patters <tr> <td> but when I fetch table there are also other tags like <color> <span> etc. so I want is to delete all the tags; in short everything with < and > .

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牵你的手，一向走下去 2024-12-14 16:38:47

sed 's/<[^>]\+>//g' 将删除所有标签，但您可能需要用空格替换它们，以便标签彼此相邻不要一起运行：onetwo 变成：onetwo。因此，您可以执行 sed 's/<[^>]\+>/ /g' 这样它就会输出 一二 （嗯，实际上 一二）。

也就是说，除非您只需要原始文本，并且听起来您正尝试在剥离标签后对数据进行一些转换，否则像 Perl 这样的脚本语言可能是更合适的工具来完成这项工作。

由于 mu 太短，提到抓取 HTML 可能有点冒险，因此使用真正为您解析 HTML 的东西将是做到这一点的最佳方法。 PHPs DOM API 非常适合这类事情。

回复收藏 0 原文

八巷 2024-12-14 16:38:47

原文：

Mac Terminal REGEX 的行为有点不同。我可以使用以下示例在我的 Mac 上执行此操作：

$ curl google.com | sed 's/<[^>]*>//g'
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385

301 Moved
301 Moved
The document has moved
here.

$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

编辑：

只是为了澄清起见，原始内容如下：

$ curl googl.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

另外，可以使用 -s 选项消除烦人的卷曲标头：

$ curl -s google.com | sed 's/<[^>]*>//g' 

301 Moved
301 Moved
The document has moved
here.

$

Original:

Mac Terminal REGEX behaves a bit differently. I was able to do this on my Mac using the following example:

$ curl google.com | sed 's/<[^>]*>//g'
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385

301 Moved
301 Moved
The document has moved
here.

$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Edit:

Just for clarification sake the origional looked like:

$ curl googl.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

Also the annoying curl header can be rid of using the -s option:

$ curl -s google.com | sed 's/<[^>]*>//g' 

301 Moved
301 Moved
The document has moved
here.

$

回复收藏 0 原文

放低过去 2024-12-14 16:38:47

这将删除指定的标签：

#!/usr/bin/python3
# based on: https://unix.stackexchange.com/a/606006/37153
import sys

from bs4 import BeautifulSoup
if len(sys.argv) != 3:

    print("2 args required: HTML/XML file, tag name")

    sys.exit(1)
with open(sys.argv[1]) as fp:

    soup = BeautifulSoup(fp, features="lxml")

for s in soup(sys.argv[2]):       #

This will delete specified tags:

#!/usr/bin/python3

# based on: https://unix.stackexchange.com/a/606006/37153

import sys
from bs4 import BeautifulSoup

if len(sys.argv) != 3:
    print("2 args required: HTML/XML file, tag name")
    sys.exit(1)

with open(sys.argv[1]) as fp:
    soup = BeautifulSoup(fp, features="lxml")
for s in soup(sys.argv[2]):       # ????https://gist.github.com/leonardreidy/40381da2588126928058
    s.extract()
print(soup)

回复收藏 0 原文

~没有更多了~