BeautifulStoneSoup - 如何取消转义并添加结束标签

发布于 2024-12-06 15:29:45 字数 1065 浏览 0 评论 0原文

我正在此处编辑原始帖子以进行澄清，并希望我已将其简化为更易于管理的内容。我有一个 xml 字符串，看起来像：

<foo id="foo">
    <row>
        &lt;img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764"&gt;
    </row>
    <row>
        &lt;img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225"&gt;
    </row>
</foo>

所以，我正在做类似的事情：

xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)

结果是这样的：

<foo id="foo">
    <row>
        <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
    </row>
    <row>
        <img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
    </row>
</foo>

请注意，每个 .xml 中的 img 标签上没有结束标签。不确定这是我的问题，但有可能。当我尝试这样做时：

images = xml.findAll('img')

它会产生一个空列表。知道为什么 BeautifulStoneSoup 在这段 xml 中找不到我的图像吗？

原文

I'm editing the original post here to clarify and hopefully I have boiled it down into something more manageable. I have a string of xml that looks something like:

<foo id="foo">
    <row>
        <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
    </row>
    <row>
        <img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
    </row>
</foo>

So, I'm doing something like:

xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)

The result of that is something like:

<foo id="foo">
    <row>
        <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
    </row>
    <row>
        <img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
    </row>
</foo>

Notice there are no closing tags on the img tags in each . Not sure this is my issue, but possible. When I try and do:

images = xml.findAll('img')

it's is yielding an empty list. Any ideas why BeautifulStoneSoup wouldn't find my images in this snippet of xml?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

染墨丶若流云 2024-12-13 15:29:45

您找不到 img 标签的原因是因为 BeautifulSoup 将它们视为“row”标签的文本部分。转换实体只是更改字符串，不会更改文档的基础结构。以下不是一个很好的解决方案（它解析文档两次），但当我在示例 xml 上测试它时它有效。这里的想法是将文本转换为坏的 xml，然后让 beautiful soup 再次清理它。

soup = BeautifulSoup(BeautifulSoup(text,convertEntities=BeautifulSoup.HTML_ENTITIES).prettify())
print soup.findAll('img')

The reason you are not finding the img tags is because BeautifulSoup is treating them as the text part of the "row" tag. Converting entities just changes the strings, it doesn't change the underlying structure of the document. The following isn't a great solution (it parses the document twice), but it worked when I tested it on your sample xml. The idea here is to convert the text to bad xml, then have beautiful soup clean it up again.

soup = BeautifulSoup(BeautifulSoup(text,convertEntities=BeautifulSoup.HTML_ENTITIES).prettify())
print soup.findAll('img')

回复收藏 0 原文

~没有更多了~

关于作者

掩于岁月

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

BeautifulStoneSoup - 如何取消转义并添加结束标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

BeautifulStoneSoup - 如何取消转义并添加结束标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。