我正在尝试对来自客户端的一些 HTML 输入进行清理和 XSS 防护。我正在使用 Python 2.6 和 Beautiful Soup。我解析输入,剥离不在白名单中的所有标签和属性,并将树转换回字符串。
然而...
>>> unicode(BeautifulSoup('text < text'))
u'text < text'
对我来说这看起来不像有效的 HTML。使用我的标签剥离器,它为各种肮脏行为开辟了道路:
>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>
对将被删除,剩下的不仅仅是 XSS 攻击,甚至也是有效的 HTML。
显而易见的解决方案是将所有 <
字符替换为 <
,解析后发现不属于标签(与 > 类似) ;&'"
)。但仅 Beautiful Soup 文档提到了实体的解析,而不是实体的生成。当然,我可以对所有 NavigableString
节点运行替换,但由于我可能会错过一些东西,所以我宁愿让一些经过尝试和测试的代码来执行此操作。 为什么 Beautiful Soup默认
情况下不转义 <
(和其他魔法字符),我该如何让它做到这一点?
NB 我也看过lxml.html.clean
。它似乎是在黑名单的基础上工作的,而不是白名单,所以对我来说,标签可以列入白名单,但属性不能,而且它也允许。许多属性符合我的口味(例如 tabindex
)。另外,它在输入
。不好。
也非常欢迎提出其他清理 HTML 的方法的建议。我并不是世界上唯一尝试这样做的人,但似乎没有标准的解决方案。
I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string.
However...
>>> unicode(BeautifulSoup('text < text'))
u'text < text'
That doesn't look like valid HTML to me. And with my tag stripper, it opens the way to all sorts of nastiness:
>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>
The <script></script>
pairs will be removed, and what remains is not only an XSS attack, but even valid HTML as well.
The obvious solution is to replace all <
characters by <
that, after parsing, are found not to belong to a tag (and similar for >&'"
). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all NavigableString
nodes, but since I might miss something, I'd rather let some tried and tested code do the work.
Why doesn't Beautiful Soup escape <
(and other magic characters) by default, and how do I make it do that?
N.B. I've also looked at lxml.html.clean
. It seems to work on the basis of blacklisting, not whitelisting, so it doesn't seem very safe to me. Tags can be whitelisted, but attributes cannot, and it allows too many attributes for my taste (e.g. tabindex
). Also, it gives an AssertionError
on the input <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
. Not good.
Suggestions for other ways to clean HTML are also very welcome. I'm hardly the only person in the world trying to do this, yet there seems to be no standard solution.
发布评论
评论(2)
我知道这是您最初提出问题 3.5 年之后的事,但您可以使用
pretify()
、encode()
或decode()
的formatter='html'
参数> 生成格式良好的 HTML。I know this is 3.5yrs after your original question, but you can use the
formatter='html'
argument toprettify()
,encode()
, ordecode()
to produce well-formed HTML.lxml.html.clean.Cleaner
类确实允许您使用allow_tags
参数提供标签白名单,并使用 feedparser 中的预先计算的属性白名单和safe_attrs_only
参数。并且 lxml 绝对可以在序列化时正确处理实体。The
lxml.html.clean.Cleaner
class does allow you to provide a tag whitelist with theallow_tags
argument and to use the precomputed attribute whitelist from feedparser with thesafe_attrs_only
argument. And lxml definitely handles entities properly on serialization.