如何让Beautiful Soup输出HTML实体？

发布于 2024-09-18 05:07:52 字数 1373 浏览 2 评论 0 原文

我正在尝试对来自客户端的一些 HTML 输入进行清理和 XSS 防护。我正在使用 Python 2.6 和 Beautiful Soup。我解析输入，剥离不在白名单中的所有标签和属性，并将树转换回字符串。

然而...

>>> unicode(BeautifulSoup('text < text'))
u'text < text'

对我来说这看起来不像有效的 HTML。使用我的标签剥离器，它为各种肮脏行为开辟了道路：

>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>

对将被删除，剩下的不仅仅是 XSS 攻击，甚至也是有效的 HTML。

显而易见的解决方案是将所有 < 字符替换为 <，解析后发现不属于标签（与 > 类似） ;&'")。但仅 Beautiful Soup 文档提到了实体的解析，而不是实体的生成。当然，我可以对所有 NavigableString 节点运行替换，但由于我可能会错过一些东西，所以我宁愿让一些经过尝试和测试的代码来执行此操作。 为什么 Beautiful Soup默认

情况下不转义 < （和其他魔法字符），我该如何让它做到这一点？

NB 我也看过lxml.html.clean。它似乎是在黑名单的基础上工作的，而不是白名单，所以对我来说，标签可以列入白名单，但属性不能，而且它也允许。许多属性符合我的口味（例如 tabindex）。另外，它在输入。不好。

也非常欢迎提出其他清理 HTML 的方法的建议。我并不是世界上唯一尝试这样做的人，但似乎没有标准的解决方案。

原文

I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string.

However...

>>> unicode(BeautifulSoup('text < text'))
u'text < text'

That doesn't look like valid HTML to me. And with my tag stripper, it opens the way to all sorts of nastiness:

>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>

The <script></script> pairs will be removed, and what remains is not only an XSS attack, but even valid HTML as well.

The obvious solution is to replace all < characters by < that, after parsing, are found not to belong to a tag (and similar for >&'"). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all NavigableString nodes, but since I might miss something, I'd rather let some tried and tested code do the work.

Why doesn't Beautiful Soup escape < (and other magic characters) by default, and how do I make it do that?

N.B. I've also looked at lxml.html.clean. It seems to work on the basis of blacklisting, not whitelisting, so it doesn't seem very safe to me. Tags can be whitelisted, but attributes cannot, and it allows too many attributes for my taste (e.g. tabindex). Also, it gives an AssertionError on the input <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>. Not good.

Suggestions for other ways to clean HTML are also very welcome. I'm hardly the only person in the world trying to do this, yet there seems to be no standard solution.

分享到QQ

分享到微博