当前位置：文江博客话题详情

Python regex beautifulsoup

Beautiful Soup - 如何修复损坏的标签

发布于 2024-12-06 03:53:01 字数 594 浏览 0 评论 0 原文

我想知道如何在使用 Beautiful Soup 解析之前修复损坏的 html 标签。

在以下脚本中，td> 需要替换为 。

我怎样才能进行替换以便 Beautiful Soup 可以看到它？

from BeautifulSoup import BeautifulSoup

s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""

a = BeautifulSoup(s)

left = []
right = []

for tr in a.findAll('tr'):
    l, r = tr.findAll('td')
    left.extend(l.findAll(text=True))
    right.extend(r.findAll(text=True))

print left + right

原文

I'd like to know how to fix broken html tags before parsing it with Beautiful Soup.

In the following script the td> needs to be replaced with <td.

How can I do the substitution so Beautiful Soup can see it?

from BeautifulSoup import BeautifulSoup

s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""

a = BeautifulSoup(s)

left = []
right = []

for tr in a.findAll('tr'):
    l, r = tr.findAll('td')
    left.extend(l.findAll(text=True))
    right.extend(r.findAll(text=True))

print left + right

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

指尖凝香 2024-12-13 03:53:01

编辑（工作）：

我从 w3 中获取了所有 html 标签的完整（至少应该是完整的）列表来进行匹配。尝试一下：

fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
                           a|abbr|acronym|address|applet|area|\
                           b|base|basefont|bdo|big|blockquote|body|br|button|\
                           caption|center|cite|code|col|colgroup|\
                           dd|del|dfn|dir|div|dl|dt|\
                           em|\
                           fieldset|font|form|frame|frameset|\
                           head|h1|h2|h3|h4|h5|h6|hr|html|\
                           i|iframe|img|input|ins|\
                           kbd|\
                           label|legend|li|link|\
                           map|menu|meta|\
                           noframes|noscript|\
                           object|ol|optgroup|option|\
                           p|param|pre|\
                           q|\
                           s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                           table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                           u|ul|\
                           var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)

产生：

>>> print s

<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
                       a|abbr|acronym|address|applet|area|\
                       b|base|basefont|bdo|big|blockquote|body|br|button|\
                       caption|center|cite|code|col|colgroup|\
                       dd|del|dfn|dir|div|dl|dt|\
                       em|\
                       fieldset|font|form|frame|frameset|\
                       head|h1|h2|h3|h4|h5|h6|hr|html|\
                       i|iframe|img|input|ins|\
                       kbd|\
                       label|legend|li|link|\
                       map|menu|meta|\
                       noframes|noscript|\
                       object|ol|optgroup|option|\
                       p|param|pre|\
                       q|\
                       s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                       table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                       u|ul|\
                       var)>", "><\g<1>>", s)

<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

这个也应该匹配损坏的结束标签（）：

re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
                 b|base|basefont|bdo|big|blockquote|body|br|button|\
                 caption|center|cite|code|col|colgroup|\
                 dd|del|dfn|dir|div|dl|dt|\
                 em|\
                 fieldset|font|form|frame|frameset|\
                 head|h1|h2|h3|h4|h5|h6|hr|html|\
                 i|iframe|img|input|ins|\
                 kbd|\
                 label|legend|li|link|\
                 map|menu|meta|\
                 noframes|noscript|\
                 object|ol|optgroup|option|\
                 p|param|pre|\
                 q|\
                 s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                 table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                 u|ul|\
                 var)>", "><\g<1>\g<2>>", s)

Edit (working):

I grabbed a complete (at least it should be complete) list of all html tags from w3 to match against. Try it out:

fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
                           a|abbr|acronym|address|applet|area|\
                           b|base|basefont|bdo|big|blockquote|body|br|button|\
                           caption|center|cite|code|col|colgroup|\
                           dd|del|dfn|dir|div|dl|dt|\
                           em|\
                           fieldset|font|form|frame|frameset|\
                           head|h1|h2|h3|h4|h5|h6|hr|html|\
                           i|iframe|img|input|ins|\
                           kbd|\
                           label|legend|li|link|\
                           map|menu|meta|\
                           noframes|noscript|\
                           object|ol|optgroup|option|\
                           p|param|pre|\
                           q|\
                           s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                           table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                           u|ul|\
                           var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)

Produces:

>>> print s

<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
                       a|abbr|acronym|address|applet|area|\
                       b|base|basefont|bdo|big|blockquote|body|br|button|\
                       caption|center|cite|code|col|colgroup|\
                       dd|del|dfn|dir|div|dl|dt|\
                       em|\
                       fieldset|font|form|frame|frameset|\
                       head|h1|h2|h3|h4|h5|h6|hr|html|\
                       i|iframe|img|input|ins|\
                       kbd|\
                       label|legend|li|link|\
                       map|menu|meta|\
                       noframes|noscript|\
                       object|ol|optgroup|option|\
                       p|param|pre|\
                       q|\
                       s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                       table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                       u|ul|\
                       var)>", "><\g<1>>", s)

<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

This one should match broken ending tags as well (</endtag>):

re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
                 b|base|basefont|bdo|big|blockquote|body|br|button|\
                 caption|center|cite|code|col|colgroup|\
                 dd|del|dfn|dir|div|dl|dt|\
                 em|\
                 fieldset|font|form|frame|frameset|\
                 head|h1|h2|h3|h4|h5|h6|hr|html|\
                 i|iframe|img|input|ins|\
                 kbd|\
                 label|legend|li|link|\
                 map|menu|meta|\
                 noframes|noscript|\
                 object|ol|optgroup|option|\
                 p|param|pre|\
                 q|\
                 s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                 table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                 u|ul|\
                 var)>", "><\g<1>\g<2>>", s)

回复收藏 0 原文

甜是你 2024-12-13 03:53:01

如果这是您唯一关心的事情 td> -> ，尝试：

myString = re.sub('td>', '<td>', myString)

在将 myString 发送到 BeautifulSoup 之前。如果还有其他损坏的标签，请给我们一些示例，我们将对其进行处理:)

If that's the only thing you're concerned about td> -> , try:

myString = re.sub('td>', '<td>', myString)

Before sending myString to BeautifulSoup. If there are other broken tags give us some examples and we'll work on it : )

回复收藏 0 原文

~没有更多了~

关于作者

看透却不说透

暂无简介

0 文章

0 评论

858 人气

关注发私信

胡图图

文章 0 评论 0

关注

zt006

文章 0 评论 0

关注

z祗昰~

文章 0 评论 0

关注

冰葑

文章 0 评论 0

关注

野の

文章 0 评论 0

关注

天空

文章 0 评论 0

友情链接

文江博客

Beautiful Soup - 如何修复损坏的标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

Beautiful Soup - 如何修复损坏的标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。