Libxml Cleaner 添加了不需要的
标记为 HTML 片段

发布于 2024-11-16 18:31:59 字数 656 浏览 1 评论 0原文

我正在尝试使用 libxml 的 HTML 清理器来清理用户输入以防止 XSS 注入。当我输入这样的字符串时：

Normal text <b>Bold text</b>

我得到的是：

<p>Normal text <b>Bold text</b></p>

我想摆脱所有输入周围的

标签。

以下是当前执行清理操作的函数：

from lxml.html import clean

cleaner = clean.Cleaner(
    scripts = True,
    javascript = True,
    allow_tags = None,
)

def sanitize_html(html):
    return cleaner.clean_html(html)

在不相关的注释中，上面的代码有一行： allow_tags = None 我试图在其中删除所有 HTML 标签。 libxml 是否具有白名单功能，我仅允许某些标签？

原文

I'm trying to sanitize user input to prevent XSS injection using libxml's HTML cleaner. When I input a string like this:

Normal text <b>Bold text</b>

I get this instead:

<p>Normal text <b>Bold text</b></p>

I want to get rid of the <p> tag that surrounds all of my input.

Here is the function that currently does the cleaning:

from lxml.html import clean

cleaner = clean.Cleaner(
    scripts = True,
    javascript = True,
    allow_tags = None,
)

def sanitize_html(html):
    return cleaner.clean_html(html)

On an unrelated note, the above code has one line: allow_tags = None where I'm trying to remove all HTML tags. Does libxml have a whitelist feature where I only allow certain tags?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盗梦空间 2024-11-23 18:31:59

所有 TEXT 片段/节点都必须包含在某种元素中。 libxml 将尝试尽可能地修复此问题。

def sanitize_html(html):
    cleaned_html = cleaner.clean_html(html)
    return re.sub(r'</p>
缓存已编译的正则表达式或找到更有效的方法来完成此操作留给查看者作为练习。无需重新审查 libxml2，我认为您可以摆脱困境：
return cleaned_html[3:-4]     # Single slice operation
return cleaned_html[3:][:-4]

, '', re.sub(r'^<p>', '', cleaned_html))

缓存已编译的正则表达式或找到更有效的方法来完成此操作留给查看者作为练习。无需重新审查 libxml2，我认为您可以摆脱困境：

All TEXT fragments/nodes must be enclosed in some kind of an element. libxml will attempt to fix this as best as it can.

def sanitize_html(html):
    cleaned_html = cleaner.clean_html(html)
    return re.sub(r'</p>
Caching the compiled regular expressions or finding more efficient ways of doing this is left as an exercise to the viewer. Without re-reviewing libxml2, I think you can get away with a slice:
return cleaned_html[3:-4]     # Single slice operation
return cleaned_html[3:][:-4]

, '', re.sub(r'^<p>', '', cleaned_html))

Caching the compiled regular expressions or finding more efficient ways of doing this is left as an exercise to the viewer. Without re-reviewing libxml2, I think you can get away with a slice:

回复收藏 0 原文

~没有更多了~