如何让Beautiful Soup输出HTML实体?

发布于 2024-09-18 05:07:52 字数 1373 浏览 2 评论 0 原文

我正在尝试对来自客户端的一些 HTML 输入进行清理和 XSS 防护。我正在使用 Python 2.6 和 Beautiful Soup。我解析输入,剥离不在白名单中的所有标签和属性,并将树转换回字符串。

然而...

>>> unicode(BeautifulSoup('text < text'))
u'text < text'

对我来说这看起来不像有效的 HTML。使用我的标签剥离器,它为各种肮脏行为开辟了道路:

>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>

对将被删除,剩下的不仅仅是 XSS 攻击,甚至也是有效的 HTML。

显而易见的解决方案是将所有 < 字符替换为 &lt;,解析后发现不属于标签(与 > 类似) ;&'")。但仅 Beautiful Soup 文档提到了实体的解析,而不是实体的生成。当然,我可以对所有 NavigableString 节点运行替换,但由于我可能会错过一些东西,所以我宁愿让一些经过尝试和测试的代码来执行此操作。 为什么 Beautiful Soup默认

情况下不转义 < (和其他魔法字符),我该如何让它做到这一点?


NB 我也看过lxml.html.clean。它似乎是在黑名单的基础上工作的,而不是白名单,所以对我来说,标签可以列入白名单,但属性不能,而且它也允许。许多属性符合我的口味(例如 tabindex)。另外,它在输入 。不好。

也非常欢迎提出其他清理 HTML 的方法的建议。我并不是世界上唯一尝试这样做的人,但似乎没有标准的解决方案。

I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string.

However...

>>> unicode(BeautifulSoup('text < text'))
u'text < text'

That doesn't look like valid HTML to me. And with my tag stripper, it opens the way to all sorts of nastiness:

>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>

The <script></script> pairs will be removed, and what remains is not only an XSS attack, but even valid HTML as well.

The obvious solution is to replace all < characters by < that, after parsing, are found not to belong to a tag (and similar for >&'"). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all NavigableString nodes, but since I might miss something, I'd rather let some tried and tested code do the work.

Why doesn't Beautiful Soup escape < (and other magic characters) by default, and how do I make it do that?


N.B. I've also looked at lxml.html.clean. It seems to work on the basis of blacklisting, not whitelisting, so it doesn't seem very safe to me. Tags can be whitelisted, but attributes cannot, and it allows too many attributes for my taste (e.g. tabindex). Also, it gives an AssertionError on the input <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>. Not good.

Suggestions for other ways to clean HTML are also very welcome. I'm hardly the only person in the world trying to do this, yet there seems to be no standard solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

凡尘雨 2024-09-25 05:07:52

我知道这是您最初提出问题 3.5 年之后的事,但您可以使用 pretify()encode()decode()formatter='html' 参数> 生成格式良好的 HTML。

I know this is 3.5yrs after your original question, but you can use the formatter='html' argument to prettify(), encode(), or decode() to produce well-formed HTML.

聆听风音 2024-09-25 05:07:52

lxml.html.clean.Cleaner 类确实允许您使用 allow_tags 参数提供标签白名单,并使用 feedparser 中的预先计算的属性白名单和 safe_attrs_only 参数。并且 lxml 绝对可以在序列化时正确处理实体。

The lxml.html.clean.Cleaner class does allow you to provide a tag whitelist with the allow_tags argument and to use the precomputed attribute whitelist from feedparser with the safe_attrs_only argument. And lxml definitely handles entities properly on serialization.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文