Beautiful Soup - 如何修复损坏的标签

发布于 2024-12-06 03:53:01 字数 594 浏览 0 评论 0 原文

我想知道如何在使用 Beautiful Soup 解析之前修复损坏的 html 标签。

在以下脚本中,td> 需要替换为

我怎样才能进行替换以便 Beautiful Soup 可以看到它?

from BeautifulSoup import BeautifulSoup

s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""

a = BeautifulSoup(s)

left = []
right = []

for tr in a.findAll('tr'):
    l, r = tr.findAll('td')
    left.extend(l.findAll(text=True))
    right.extend(r.findAll(text=True))

print left + right

I'd like to know how to fix broken html tags before parsing it with Beautiful Soup.

In the following script the td> needs to be replaced with <td.

How can I do the substitution so Beautiful Soup can see it?

from BeautifulSoup import BeautifulSoup

s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""

a = BeautifulSoup(s)

left = []
right = []

for tr in a.findAll('tr'):
    l, r = tr.findAll('td')
    left.extend(l.findAll(text=True))
    right.extend(r.findAll(text=True))

print left + right

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

指尖凝香 2024-12-13 03:53:01

编辑(工作):

我从 w3 中获取了所有 html 标签的完整(至少应该是完整的)列表来进行匹配。尝试一下:

fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
                           a|abbr|acronym|address|applet|area|\
                           b|base|basefont|bdo|big|blockquote|body|br|button|\
                           caption|center|cite|code|col|colgroup|\
                           dd|del|dfn|dir|div|dl|dt|\
                           em|\
                           fieldset|font|form|frame|frameset|\
                           head|h1|h2|h3|h4|h5|h6|hr|html|\
                           i|iframe|img|input|ins|\
                           kbd|\
                           label|legend|li|link|\
                           map|menu|meta|\
                           noframes|noscript|\
                           object|ol|optgroup|option|\
                           p|param|pre|\
                           q|\
                           s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                           table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                           u|ul|\
                           var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)

产生:

>>> print s

<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
                       a|abbr|acronym|address|applet|area|\
                       b|base|basefont|bdo|big|blockquote|body|br|button|\
                       caption|center|cite|code|col|colgroup|\
                       dd|del|dfn|dir|div|dl|dt|\
                       em|\
                       fieldset|font|form|frame|frameset|\
                       head|h1|h2|h3|h4|h5|h6|hr|html|\
                       i|iframe|img|input|ins|\
                       kbd|\
                       label|legend|li|link|\
                       map|menu|meta|\
                       noframes|noscript|\
                       object|ol|optgroup|option|\
                       p|param|pre|\
                       q|\
                       s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                       table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                       u|ul|\
                       var)>", "><\g<1>>", s)

<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

这个也应该匹配损坏的结束标签():

re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
                 b|base|basefont|bdo|big|blockquote|body|br|button|\
                 caption|center|cite|code|col|colgroup|\
                 dd|del|dfn|dir|div|dl|dt|\
                 em|\
                 fieldset|font|form|frame|frameset|\
                 head|h1|h2|h3|h4|h5|h6|hr|html|\
                 i|iframe|img|input|ins|\
                 kbd|\
                 label|legend|li|link|\
                 map|menu|meta|\
                 noframes|noscript|\
                 object|ol|optgroup|option|\
                 p|param|pre|\
                 q|\
                 s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                 table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                 u|ul|\
                 var)>", "><\g<1>\g<2>>", s)

Edit (working):

I grabbed a complete (at least it should be complete) list of all html tags from w3 to match against. Try it out:

fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
                           a|abbr|acronym|address|applet|area|\
                           b|base|basefont|bdo|big|blockquote|body|br|button|\
                           caption|center|cite|code|col|colgroup|\
                           dd|del|dfn|dir|div|dl|dt|\
                           em|\
                           fieldset|font|form|frame|frameset|\
                           head|h1|h2|h3|h4|h5|h6|hr|html|\
                           i|iframe|img|input|ins|\
                           kbd|\
                           label|legend|li|link|\
                           map|menu|meta|\
                           noframes|noscript|\
                           object|ol|optgroup|option|\
                           p|param|pre|\
                           q|\
                           s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                           table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                           u|ul|\
                           var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)

Produces:

>>> print s

<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
                       a|abbr|acronym|address|applet|area|\
                       b|base|basefont|bdo|big|blockquote|body|br|button|\
                       caption|center|cite|code|col|colgroup|\
                       dd|del|dfn|dir|div|dl|dt|\
                       em|\
                       fieldset|font|form|frame|frameset|\
                       head|h1|h2|h3|h4|h5|h6|hr|html|\
                       i|iframe|img|input|ins|\
                       kbd|\
                       label|legend|li|link|\
                       map|menu|meta|\
                       noframes|noscript|\
                       object|ol|optgroup|option|\
                       p|param|pre|\
                       q|\
                       s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                       table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                       u|ul|\
                       var)>", "><\g<1>>", s)

<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

This one should match broken ending tags as well (</endtag>):

re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
                 b|base|basefont|bdo|big|blockquote|body|br|button|\
                 caption|center|cite|code|col|colgroup|\
                 dd|del|dfn|dir|div|dl|dt|\
                 em|\
                 fieldset|font|form|frame|frameset|\
                 head|h1|h2|h3|h4|h5|h6|hr|html|\
                 i|iframe|img|input|ins|\
                 kbd|\
                 label|legend|li|link|\
                 map|menu|meta|\
                 noframes|noscript|\
                 object|ol|optgroup|option|\
                 p|param|pre|\
                 q|\
                 s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                 table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                 u|ul|\
                 var)>", "><\g<1>\g<2>>", s)
甜是你 2024-12-13 03:53:01

如果这是您唯一关心的事情 td> -> ,尝试:

myString = re.sub('td>', '<td>', myString)

在将 myString 发送到 BeautifulSoup 之前。如果还有其他损坏的标签,请给我们一些示例,我们将对其进行处理:)

If that's the only thing you're concerned about td> -> , try:

myString = re.sub('td>', '<td>', myString)

Before sending myString to BeautifulSoup. If there are other broken tags give us some examples and we'll work on it : )

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文