删除sed或类似中的html标签

发布于 2024-12-07 16:38:47 字数 504 浏览 0 评论 0原文

我正在尝试从网页中获取表的内容。我只是需要内容,但不需要标签 。我什至不需要“tr”或“td”,只需要内容。例如:

<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>

我也想将第一列输出像这样放在一个新的 csv 文件中 列1、信息1、信息2、信息3 coumn2,info1,info2,info3

我尝试使用 sed 删除模式 但当我获取表格时,还有其他标签,例如 < ;color> 等所以我想要的是删除所有标签;简而言之,所有带有 < 的内容和> 。

I am trying to fetch contents of table from a wepage. I jsut need the contents but not the tags <tr></tr>. I don't even need "tr" or "td" just the content. for eg:

<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>

also I would like to put the first column output like this in a new csv file
column1,info1,info2,info3
coumn2,info1,info2,info3

I tried sed to deleted patters <tr> <td> but when I fetch table there are also other tags like <color> <span> etc. so I want is to delete all the tags; in short everything with < and > .

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

牵你的手,一向走下去 2024-12-14 16:38:47

sed 's/<[^>]\+>//g' 将删除所有标签,但您可能需要用空格替换它们,以便标签彼此相邻不要一起运行:onetwo 变成:onetwo。因此,您可以执行 sed 's/<[^>]\+>/ /g' 这样它就会输出 一二 (嗯,实际上 一二)。

也就是说,除非您只需要原始文本,并且听起来您正尝试在剥离标签后对数据进行一些转换,否则像 Perl 这样的脚本语言可能是更合适的工具来完成这项工作。

由于 mu 太短,提到抓取 HTML 可能有点冒险,因此使用真正为您解析 HTML 的东西将是做到这一点的最佳方法。 PHPs DOM API 非常适合这类事情。

sed 's/<[^>]\+>//g' will strip all tags out, but you might want to replace them with a space so tags that are next to each other don't run together: <td>one</td><td>two</td> becoming: onetwo. So you could do sed 's/<[^>]\+>/ /g' so it would output one two (well, actually one two).

That said unless you need just the raw text, and it sounds like you are trying to do some transformations to the data after stripping the tags, a scripting language like Perl might be a more fitting tool to do this stuff with.

As mu is too short mentioned scraping HTML can be a bit dicey, using something that actually parses the HTML for you would be the best way to do this. PHPs DOM API is pretty good for these kinds of things.

八巷 2024-12-14 16:38:47

原文:

Mac Terminal REGEX 的行为有点不同。我可以使用以下示例在我的 Mac 上执行此操作:

$ curl google.com | sed 's/<[^>]*>//g'
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385

301 Moved
301 Moved
The document has moved
here.

$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

编辑:

只是为了澄清起见,原始内容如下:

$ curl googl.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

另外,可以使用 -s 选项消除烦人的卷曲标头:

$ curl -s google.com | sed 's/<[^>]*>//g' 

301 Moved
301 Moved
The document has moved
here.

$

Original:

Mac Terminal REGEX behaves a bit differently. I was able to do this on my Mac using the following example:

$ curl google.com | sed 's/<[^>]*>//g'
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385

301 Moved
301 Moved
The document has moved
here.

$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Edit:

Just for clarification sake the origional looked like:

$ curl googl.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

Also the annoying curl header can be rid of using the -s option:

$ curl -s google.com | sed 's/<[^>]*>//g' 

301 Moved
301 Moved
The document has moved
here.

$
放低过去 2024-12-14 16:38:47

这将删除指定的标签:

#!/usr/bin/python3

# based on: https://unix.stackexchange.com/a/606006/37153

import sys
from bs4 import BeautifulSoup

if len(sys.argv) != 3:
print("2 args required: HTML/XML file, tag name")
sys.exit(1)

with open(sys.argv[1]) as fp:
soup = BeautifulSoup(fp, features="lxml")
for s in soup(sys.argv[2]): #

This will delete specified tags:

#!/usr/bin/python3

# based on: https://unix.stackexchange.com/a/606006/37153

import sys
from bs4 import BeautifulSoup

if len(sys.argv) != 3:
    print("2 args required: HTML/XML file, tag name")
    sys.exit(1)

with open(sys.argv[1]) as fp:
    soup = BeautifulSoup(fp, features="lxml")
for s in soup(sys.argv[2]):       # ????https://gist.github.com/leonardreidy/40381da2588126928058
    s.extract()
print(soup)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文