BeautifulStoneSoup - 如何取消转义并添加结束标签
我正在此处编辑原始帖子以进行澄清,并希望我已将其简化为更易于管理的内容。我有一个 xml 字符串,看起来像:
<foo id="foo">
<row>
<img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
</row>
<row>
<img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
</row>
</foo>
所以,我正在做类似的事情:
xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)
结果是这样的:
<foo id="foo">
<row>
<img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
</row>
<row>
<img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
</row>
</foo>
请注意,每个 .xml 中的 img 标签上没有结束标签。不确定这是我的问题,但有可能。当我尝试这样做时:
images = xml.findAll('img')
它会产生一个空列表。知道为什么 BeautifulStoneSoup 在这段 xml 中找不到我的图像吗?
I'm editing the original post here to clarify and hopefully I have boiled it down into something more manageable. I have a string of xml that looks something like:
<foo id="foo">
<row>
<img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
</row>
<row>
<img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
</row>
</foo>
So, I'm doing something like:
xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)
The result of that is something like:
<foo id="foo">
<row>
<img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
</row>
<row>
<img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
</row>
</foo>
Notice there are no closing tags on the img tags in each . Not sure this is my issue, but possible. When I try and do:
images = xml.findAll('img')
it's is yielding an empty list. Any ideas why BeautifulStoneSoup wouldn't find my images in this snippet of xml?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您找不到 img 标签的原因是因为 BeautifulSoup 将它们视为“row”标签的文本部分。转换实体只是更改字符串,不会更改文档的基础结构。以下不是一个很好的解决方案(它解析文档两次),但当我在示例 xml 上测试它时它有效。这里的想法是将文本转换为坏的 xml,然后让 beautiful soup 再次清理它。
The reason you are not finding the img tags is because BeautifulSoup is treating them as the text part of the "row" tag. Converting entities just changes the strings, it doesn't change the underlying structure of the document. The following isn't a great solution (it parses the document twice), but it worked when I tested it on your sample xml. The idea here is to convert the text to bad xml, then have beautiful soup clean it up again.