Libxml Cleaner 添加了不需要的
标记为 HTML 片段
我正在尝试使用 libxml 的 HTML 清理器来清理用户输入以防止 XSS 注入。当我输入这样的字符串时:
Normal text <b>Bold text</b>
我得到的是:
<p>Normal text <b>Bold text</b></p>
我想摆脱所有输入周围的
标签。
以下是当前执行清理操作的函数:
from lxml.html import clean
cleaner = clean.Cleaner(
scripts = True,
javascript = True,
allow_tags = None,
)
def sanitize_html(html):
return cleaner.clean_html(html)
在不相关的注释中,上面的代码有一行: allow_tags = None
我试图在其中删除所有 HTML 标签。 libxml 是否具有白名单功能,我仅允许某些标签?
I'm trying to sanitize user input to prevent XSS injection using libxml's HTML cleaner. When I input a string like this:
Normal text <b>Bold text</b>
I get this instead:
<p>Normal text <b>Bold text</b></p>
I want to get rid of the <p>
tag that surrounds all of my input.
Here is the function that currently does the cleaning:
from lxml.html import clean
cleaner = clean.Cleaner(
scripts = True,
javascript = True,
allow_tags = None,
)
def sanitize_html(html):
return cleaner.clean_html(html)
On an unrelated note, the above code has one line: allow_tags = None
where I'm trying to remove all HTML tags. Does libxml have a whitelist feature where I only allow certain tags?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
所有
TEXT
片段/节点都必须包含在某种元素中。libxml
将尝试尽可能地修复此问题。缓存已编译的正则表达式或找到更有效的方法来完成此操作留给查看者作为练习。无需重新审查 libxml2,我认为您可以摆脱困境:
All
TEXT
fragments/nodes must be enclosed in some kind of an element.libxml
will attempt to fix this as best as it can.Caching the compiled regular expressions or finding more efficient ways of doing this is left as an exercise to the viewer. Without re-reviewing libxml2, I think you can get away with a slice: