使用 html5lib.sanitizer 完全删除坏标签
我正在尝试按照 文档中的建议使用 html5lib.sanitizer 来清理用户输入问题
是我想完全删除坏标签,而不仅仅是转义它们(无论如何,这似乎是一个坏主意)。
此处补丁中建议的解决方法不起作用正如预期的那样(它保留
的内部内容)。
具体来说,我想做这样的事情:
输入:
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum
输出:
<h1>Hello world</h1>
Lorem ipsum
关于如何实现它的任何想法?我尝试过 BeautifulSoup,但它似乎效果不佳,并且 lxml 在非常奇怪的地方(例如在 src attrs 周围)插入
I'm trying to use html5lib.sanitizer to clean user-input as suggested in the docs
The problem is I want to remove bad tags completely and not just escape them (which seems like a bad idea anyway).
The workaround suggested in the patch here doesn't work as expected (it keeps inner content of a <tag>content</tag>
).
Specifically, I want to do something like this:
Input:
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum
Output:
<h1>Hello world</h1>
Lorem ipsum
Any ideas on how to achieve it? I've tried BeautifulSoup, but it doesn't seem to work well, and lxml inserts <p></p>
tags in very strange places (e.g. around src attrs). So far, html5lib seems to be the best thing for the purpose, if I could just get it to remove tags instead of escaping them.
挑战还在于去除不需要的嵌套标签。它并不漂亮,但这是朝着正确方向迈出的一步:
输出:
The challenge is to also strip unwanted nested tags. It isn't pretty but it's a step in the right direction:
Which outputs: