如何防止 BeautifulSoup 自动关闭那些看起来像标签但实际上不是的东西?
我使用 BeautifulSoup 来转义任意文本集中的所有 HTML 标签(一组预先批准的标签除外,例如 a)。但是,我只希望它转义标签(如果它们是实际有效的 HTML 标签)。如果某些东西看起来像标签,但实际上不是,它最终会添加一些 HTML 来将其关闭,这是我不想要的。
示例:如果有人输入文本
,我的代码最终会输出 <integer></integer>
而不仅仅是 <integer>
这是代码(value
是 HTML 字符串,VALID_TAGS
是只是可接受的标签名称列表)。
soup = BeautifulSoup.BeautifulSoup(
value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
if tag.name not in VALID_TAGS:
tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()
提前致谢。
I'm using BeautifulSoup to escape all of the HTML tags (except for a set of pre-approved tags, like a) from an arbitrary set of text. However, I only want it to escape the tags if they are actual valid HTML tags. If something looks like a tag, but isn't, it ends up adding some HTML to close it off, which I don't want.
Example: If someone enters in the text <integer>
, my code ends up spitting out <integer></integer>
instead of just <integer>
Here's the code (value
is the HTML string and VALID_TAGS
is just a list of acceptable tag names).
soup = BeautifulSoup.BeautifulSoup(
value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
if tag.name not in VALID_TAGS:
tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以 这个答案为起点,使用 html5lib 解决了这个问题。这是我最终得到的一个版本,它与我上面开始的 BeautifulSoup 代码做同样的事情,除了可以正常工作于我描述的
案例:感谢所有提供帮助的人。
Figured this out using html5lib based on this answer as a starting point. Here's a version of what I ended up with that does the same thing as the BeautifulSoup code I started with above, except works properly for the
<integer>
case I described:Thanks to everyone who helped.
你做错了(tm)。
BeatifulSoup 不应该这样使用。
看看这个: http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/
这个食谱删除了无效的标签,听起来你想保留它们但又逃脱了。应该是一个非常简单的修改。
You are doing it wrong (tm).
BeatifulSoup is not meant to be used like that.
Take a look at this instead: http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/
This recipe removes invalid tags and you sound like you want to keep them in but escaped. Should be a pretty easy modification.