如何防止 BeautifulSoup 自动关闭那些看起来像标签但实际上不是的东西？

发布于 2024-12-27 02:55:06 字数 735 浏览 0 评论 0原文

我使用 BeautifulSoup 来转义任意文本集中的所有 HTML 标签（一组预先批准的标签除外，例如 a）。但是，我只希望它转义标签（如果它们是实际有效的 HTML 标签）。如果某些东西看起来像标签，但实际上不是，它最终会添加一些 HTML 来将其关闭，这是我不想要的。

示例：如果有人输入文本，我的代码最终会输出 <integer></integer>而不仅仅是 <integer>

这是代码（value 是 HTML 字符串，VALID_TAGS 是只是可接受的标签名称列表）。

soup = BeautifulSoup.BeautifulSoup(
  value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
  if tag.name not in VALID_TAGS:
    tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()

提前致谢。

原文

I'm using BeautifulSoup to escape all of the HTML tags (except for a set of pre-approved tags, like a) from an arbitrary set of text. However, I only want it to escape the tags if they are actual valid HTML tags. If something looks like a tag, but isn't, it ends up adding some HTML to close it off, which I don't want.

Example: If someone enters in the text <integer>, my code ends up spitting out <integer></integer> instead of just <integer>

Here's the code (value is the HTML string and VALID_TAGS is just a list of acceptable tag names).

soup = BeautifulSoup.BeautifulSoup(
  value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
  if tag.name not in VALID_TAGS:
    tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆离笙 2025-01-03 02:55:06

以这个答案为起点，使用 html5lib 解决了这个问题。这是我最终得到的一个版本，它与我上面开始的 BeautifulSoup 代码做同样的事情，除了可以正常工作于我描述的案例：

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer, tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(value)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(quote_attr_values=True)
return s.render(stream)

感谢所有提供帮助的人。

Figured this out using html5lib based on this answer as a starting point. Here's a version of what I ended up with that does the same thing as the BeautifulSoup code I started with above, except works properly for the <integer> case I described:

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer, tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(value)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(quote_attr_values=True)
return s.render(stream)

Thanks to everyone who helped.

回复收藏 0 原文

深海不蓝 2025-01-03 02:55:06

你做错了（tm）。
BeatifulSoup 不应该这样使用。
看看这个： http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/
这个食谱删除了无效的标签，听起来你想保留它们但又逃脱了。应该是一个非常简单的修改。

回复收藏 0 原文

~没有更多了~