使用 html5lib.sanitizer 完全删除坏标签

发布于 2024-11-07 18:43:33 字数 829 浏览 14 评论 0原文

我正在尝试按照文档中的建议使用 html5lib.sanitizer 来清理用户输入问题

是我想完全删除坏标签，而不仅仅是转义它们（无论如何，这似乎是一个坏主意）。

此处补丁中建议的解决方法不起作用正如预期的那样（它保留 content 的内部内容）。

具体来说，我想做这样的事情：

输入：

<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum

输出：

<h1>Hello world</h1>
Lorem ipsum

关于如何实现它的任何想法？我尝试过 BeautifulSoup，但它似乎效果不佳，并且 lxml 在非常奇怪的地方（例如在 src attrs 周围）插入

标签。到目前为止，如果我能让 html5lib 删除标签而不是转义标签，那么 html5lib 似乎是达到此目的的最佳选择。

原文

I'm trying to use html5lib.sanitizer to clean user-input as suggested in the docs

The problem is I want to remove bad tags completely and not just escape them (which seems like a bad idea anyway).

The workaround suggested in the patch here doesn't work as expected (it keeps inner content of a <tag>content</tag>).

Specifically, I want to do something like this:

Input:

<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum

Output:

<h1>Hello world</h1>
Lorem ipsum

Any ideas on how to achieve it? I've tried BeautifulSoup, but it doesn't seem to work well, and lxml inserts <p></p> tags in very strange places (e.g. around src attrs). So far, html5lib seems to be the best thing for the purpose, if I could just get it to remove tags instead of escaping them.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

审判长 2024-11-14 18:43:33

挑战还在于去除不需要的嵌套标签。它并不漂亮，但这是朝着正确方向迈出的一步：

from lxml.html import fromstring
from lxml import etree

html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''

l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
    i = etree.Element(el.tag)
    i.text, i.tail = el.text, el.tail
    l.append(etree.tostring(i))

print ''.join(l)

输出：

<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>

The challenge is to also strip unwanted nested tags. It isn't pretty but it's a step in the right direction:

from lxml.html import fromstring
from lxml import etree

html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''

l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
    i = etree.Element(el.tag)
    i.text, i.tail = el.text, el.tail
    l.append(etree.tostring(i))

print ''.join(l)

Which outputs:

<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>

回复收藏 0 原文

~没有更多了~

关于作者

云醉月微眠

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

使用 html5lib.sanitizer 完全删除坏标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

浪子阿飞

JK.Yang

人间不值得

静待花开

只涨不跌

污浊的双黑

友情链接

使用 html5lib.sanitizer 完全删除坏标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

浪子阿飞

JK.Yang

人间不值得

静待花开

只涨不跌

污浊的双黑

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。