如何防止 BeautifulSoup 自动关闭那些看起来像标签但实际上不是的东西?

发布于 2024-12-27 02:55:06 字数 735 浏览 0 评论 0原文

我使用 BeautifulSoup 来转义任意文本集中的所有 HTML 标签(一组预先批准的标签除外,例如 a)。但是,我只希望它转义标签(如果它们是实际有效的 HTML 标签)。如果某些东西看起来像标签,但实际上不是,它最终会添加一些 HTML 来将其关闭,这是我不想要的。

示例:如果有人输入文本 ,我的代码最终会输出 <integer></integer>而不仅仅是 <integer>

这是代码(value 是 HTML 字符串,VALID_TAGS 是只是可接受的标签名称列表)。

soup = BeautifulSoup.BeautifulSoup(
  value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
  if tag.name not in VALID_TAGS:
    tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()

提前致谢。

I'm using BeautifulSoup to escape all of the HTML tags (except for a set of pre-approved tags, like a) from an arbitrary set of text. However, I only want it to escape the tags if they are actual valid HTML tags. If something looks like a tag, but isn't, it ends up adding some HTML to close it off, which I don't want.

Example: If someone enters in the text <integer>, my code ends up spitting out <integer></integer> instead of just <integer>

Here's the code (value is the HTML string and VALID_TAGS is just a list of acceptable tag names).

soup = BeautifulSoup.BeautifulSoup(
  value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
  if tag.name not in VALID_TAGS:
    tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

忆离笙 2025-01-03 02:55:06

这个答案为起点,使用 html5lib 解决了这个问题。这是我最终得到的一个版本,它与我上面开始的 BeautifulSoup 代码做同样的事情,除了可以正常工作于我描述的 案例:

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer, tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(value)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(quote_attr_values=True)
return s.render(stream)

感谢所有提供帮助的人。

Figured this out using html5lib based on this answer as a starting point. Here's a version of what I ended up with that does the same thing as the BeautifulSoup code I started with above, except works properly for the <integer> case I described:

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer, tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(value)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(quote_attr_values=True)
return s.render(stream)

Thanks to everyone who helped.

深海不蓝 2025-01-03 02:55:06

你做错了(tm)。
BeatifulSoup 不应该这样使用。
看看这个: http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/
这个食谱删除了无效的标签,听起来你想保留它们但又逃脱了。应该是一个非常简单的修改。

You are doing it wrong (tm).
BeatifulSoup is not meant to be used like that.
Take a look at this instead: http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/
This recipe removes invalid tags and you sound like you want to keep them in but escaped. Should be a pretty easy modification.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文