BeautifulSoup -- 防止标签自动关闭
BeautifulSoup 在解析以下代码时感到窒息:
>>> soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
>>> soup.prettify()
'<img src="#" alt="Click Here >" />\n" border="0" />\n'
我还应该注意,我无法控制输入 html。文本/属性有许多不同的变体,因此我想避免使用正则表达式。
任何人都可以建议阻止 BeautifulSoup 在遇到“>”时自动关闭 img 标签。象征?
编辑1:我在文档中找到了这个。我可以控制 BeautifulSoup 如何解析 IMG 标签吗?
编辑2:我解决了我的问题。在我打电话给 BS 之前,我做了一个文本替换
text.replace('>"','>"')
BeautifulSoup is choking on parsing the following code:
>>> soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
>>> soup.prettify()
'<img src="#" alt="Click Here >" />\n" border="0" />\n'
I should also note, I have no control over the input html. There are many different variations of the text/attributes so I want to avoid using Regex.
Anyone have a suggestion for stopping BeautifulSoup from automatically closing the img tag when it runs into the ">" symbol?
Edit 1: I have found this in the documentation. Could I control how BeautifulSoup parses the IMG tag?
Edit 2: I solved my problem. Before I called BS, I did did a text replace
text.replace('>"','>"')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
BeautifulSoup4 已更新为上下文感知,并已解决此问题。如果您更新到 BeautifulSoup4 的最新版本,它将忽略用引号引起来的
>
标记。该示例显示
alt
属性正确具有>
字符,并且border
属性已被识别。BeautifulSoup4 has been updated to be context aware and has since solved this issue. If you update to the latest version of BeautifulSoup4 it will ignore the
>
tag when enclosed in quotes.The example shows that the
alt
attribute correctly has the>
character, and theborder
attribute has been recognised.