清除 HTML 标签中的所有内联事件
对于 HTML 输入,我想中和所有具有内联 js 的 HTML 元素(onclick=".."、onmouseout=".." 等)。 我在想,对下面的字符进行编码还不够吗? =,(,)
所以 onclick="location.href='ggg.com'"
会变成 onclick%3D"location.href%3D'ggg.com'"
我在这里缺少什么?
编辑:我确实需要接受活动 HTML(我无法转义全部或实体)。
For HTML input, I want to neutralize all HTML elements that have inline js (onclick="..", onmouseout=".." etc).
I am thinking, isn't it enough to encode the following chars? =,(,)
So onclick="location.href='ggg.com'"
will become
onclick%3D"location.href%3D'ggg.com'"
What am I missing here?
Edit: I do need to accept active HTML (I can't escape it all or entities is it).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
没有简单的方法可以接受 HTML,但不能接受脚本。
您必须将 HTML 解析为 DOM,删除 DOM 中所有不需要的元素和属性并生成新的 HTML。
它可以'使用正则表达式可以可靠地完成。
on
* 属性还不够。 脚本可以嵌入到style
、src
、href
等属性中。如果您使用 PHP,请使用 HTML Purifier。
There's no simple method to accept HTML, but not scripts.
You have to parse HTML to DOM, remove all unwanted elements and attributes in DOM and generate new HTML.
It can't be done reliably with regular expressions.
on
* attributes are not enough. Scripts can be embedded instyle
,src
,href
and other attributes.If you're using PHP, then use HTML Purifier.
您可能有几个选择...最简单的方法是转换引号,并且可能 <> 字符,到它们的 HTML 编码等效项(“等),这将导致 HTML 代码按字面显示。
告诉我您使用的服务器端语言,如果您愿意,我可以为您指出更多特定于语言的信息。 (例如,PHP 有 htmlspecialchars()[1])。
好吧,你想允许 HTML 通过但不允许 JavaScript 吗?我建议,因为我没有想到一个简单的解决方案。只需使用字符串替换(如果可以的话,可以使用正则表达式吗?)来完全摆脱它们。JavaScript
中有一组有限的事件处理程序属性,再加上引号,您可能就可以很好地
证明 这一点。概念上,在 Perl 中,您可能会这样做:
因此,捕获事件处理程序名称(仅包含其中的一些),然后使用单引号或双引号引用表达式,末尾有可选的空格,并且 不过,
这对于需要更多级别引用的内容不起作用,因为最终您会回到原来的分隔符。 请原谅这个人为的且完全无用的示例:
在这种情况下,您可能需要编写一个循环,首先按单词解析字符串(即查找事件处理程序名称),然后逐个字符进行解析,跟踪引用的数量级别并跟踪当前分隔符:
这有点耗时,但理论上无论如何,假设 HTML 格式良好,它都应该可以工作。 (这是一个可怕的假设,但如果它的格式不正确,您无论如何都可以拒绝输入!)
[1] https://www.php.net/manual/en/function.htmlspecialchars.php
You probably have a couple of options... easiest way is to convert quotes, and possibly <> characters, to their HTML encoded equivalents (" etc.), which will result in the HTML code being displayed literally.
Tell me what server-side language are you using and I can point you towards more language-specific information, if you like. (For example, PHP has htmlspecialchars()[1]).
EDIT: I just actually read your question. Okay, you want to allow HTML through but no JavaScript? Well, for lack of a simple solution jumping to my mind, I suggest just using string replacement (regular expressions if you can, maybe?) to get rid of them entirely.
There are a finite set of event handler attributes in JavaScript. Couple that with the need for quotation marks and you're probably good.
For proof of concept, in Perl, you'd probably do something like this:
So, capture the event handler name (only some of which I included), then a quoted expression using either single or double quotes, have optional whitespace on the end, and replace the entire thing with nothing (i.e., delete it).
That won't work for something requiring more levels of quotation, though, since eventually you would come back to the original delimiters. Forgive the contrived and completely useless example:
In THAT case, you might want to write a loop that parses the string first by word (i.e., looking for the event handler name), then going character by character, keeping track of the number of quoting levels as you go and keeping track of the current delimiter:
It's a little more time-consuming, but it should theoretically work no matter what, assuming the HTML is well-formed. (That's a horrible assumption, but if it's not well-formed you could just reject the input anyway!)
[1] https://www.php.net/manual/en/function.htmlspecialchars.php