从 HTML 中过滤 JavaScript
我有一个富文本编辑器,可以将 HTML 传递到服务器。 然后该 HTML 会显示给其他用户。 我想确保该 HTML 中没有 JavaScript。 有什么办法可以做到这一点吗?
另外,如果有帮助的话,我正在使用 ASP.NET。
I have a rich text editor that passes HTML to the server. That HTML is then displayed to other users. I want to make sure there is no JavaScript in that HTML. Is there any way to do this?
Also, I'm using ASP.NET if that helps.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
确保某些 HTML 标记不包含任何 JavaScript 的唯一方法是过滤掉所有不安全的 HTML 标记和属性,以防止 跨站脚本 (XSS)。
但是,通常没有可靠的方法可以通过名称显式删除所有不安全的元素和属性,因为某些浏览器可能会解释您在设计时甚至不知道的元素和属性,因此为恶意用户打开安全漏洞。 这就是为什么您最好采用白名单方法而不是黑名单方法。 也就是说,仅允许您确定安全的 HTML 标记,并默认删除所有其他标记。 事实上,只有一个意外允许的标签就会使您的网站容易受到 XSS 攻击。
白名单(好方法)
请参阅这篇关于 HTML 清理 的文章,其中提供了一些特定的信息为什么应该将其列入白名单而不是黑名单的示例。 引用该页面的内容:
< a href="http://software.open-xchange.com/OX6/doc/Html-Whitelist/ch01.html" rel="nofollow noreferrer">这里是另一个有用的页面,建议一组 HTML标签和 属性以及通常安全允许的 CSS 属性,以及推荐的做法。
黑名单(通常是不好的方法)
尽管许多网站过去(和目前)都使用黑名单方法,但几乎从来没有真正需要它。 (安全风险总是超过白名单通过授予用户的格式化功能所强制执行的潜在限制。)您需要非常清楚它的缺陷。
例如,此页面给出了所谓“所有”的列表您可能想要删除 HTML 标签。 只需简单观察一下,您就会发现它包含的元素名称数量非常有限; 浏览器很容易包含一个专有标签,无意中允许脚本在您的页面上运行,这本质上是黑名单的主要问题。
最后,我强烈建议您使用 HTML DOM 库(例如众所周知的 HTML Agility Pack) for .NET,而不是 RegEx 来执行清理/白名单,因为它会更加可靠。 (很可能创建一些非常疯狂的混淆 HTML 来愚弄正则表达式!无论如何,一个合适的 HTML 阅读器/编写器可以使系统编码变得更加容易。)
希望这能让您对需要设计的内容有一个很好的概述为了完全(或至少最大限度地)防止 XSS,以及在考虑未知因素的情况下执行 HTML 清理的重要性。
The only way to ensure that some HTML markup does not contain any JavaScript is to filter it of all unsafe HTML tags and attributes, in order to prevent Cross-Site Scripting (XSS).
However, there is in general no reliable way of explicitly removing all unsafe elements and attributes by their names, since certain browsers may interpret ones of which you weren't even aware at the time of design, and thus open up a security hole for malicious users. This is why you're much better off taking a whitelisting approach rather than a blacklisting one. That is to say, only allow HTML tags that you are sure are safe, and stripping all others by default. Indeed, only one accidentally permitted tag can make your website vulnerable to XSS.
Whitelisting (good approach)
See this article on HTML sanitisation, which offers some specific examples of why you should whitelist rather than blacklist. Quote from that page:
Here is another helpful page that suggests a set of HTML tags & attributes as well as CSS attributes that are typically safe to allow, as well as recommended practices.
Blacklisting (generally bad approach)
Although many website have in the past (and currently) use the blacklisting approach, there is almost never any true need for it. (The security risks invariably outweight the potential limitations whitelisting enforces with the formatting capabilities that are granted to the user.) You need to be very aware of its flaws.
For example, this page gives a list of what are supposedly "all" the HTML tags you might want to strip out. Just from observing it briefly, you should notice that it contains a very limited number of element names; a browser could easily include a proprietary tag that unwittingly allowed scripts to run on your page, which is essentially the main problem with blacklisting.
Finally, I would strongly recommend that you utilise an HTML DOM library (such as the well-known HTML Agility Pack) for .NET, as opposed to RegEx to perform the cleaning/whitelisting, since it would be significantly more reliable. (It is quite possible to create some pretty crazy obfuscated HTML that can fool regexes! A proper HTML reader/writer makes the coding of the system much easier, anyway.)
Hopefully that should given you a decent overview of what you need to design in order to fully (or at least maximally) prevent XSS, and how it's critical that HTML sanitisation is performed with the unknown factor in mind.
正如李·西奥博尔德(Lee Theobald)所指出的,这是一个非常危险的计划。 根据定义,您无法通过过滤/黑名单生成“安全”HTML,因为用户可能会将您没有想到的内容放入 HTML 中(或者甚至在您的浏览器版本中不存在,但在其他浏览器版本中存在)。
唯一安全的方法是白名单方法,即删除除纯文本和某些特定 HTML 结构之外的所有内容。 顺便说一句,这就是 stackoverflow.com 所做的:-)。
As pointed out by Lee Theobald, that's a very dangerous plan. You cannot by definition ever produce "safe" HTML by filtering/blacklisting, since the user might put stuff into the HTML that you didn't think about (or that don't even exist in your browser version, but does in others).
The only safe way is a whitelisting approach, i.e. strip everything but plain text and certain specific HTML constructs. This incidentially is what stackoverflow.com does :-).
这是我使用白名单方法的方法
(Javascript 和 Python 代码)
https://github.com/dcollien/FilterHTML
我定义了一个规范允许的 HTML 子集,这只是应该通过此过滤器的内容。
还有一些选项可以通过仅允许某些方案(例如 http:、ftp: 等)并禁止那些会导致 XSS/Javascript 问题的方案(例如 javascript: 甚至 data:)来净化 URL 属性
编辑:这不是它不会在所有情况下为您提供 100% 开箱即用的安全性,但明智地使用并结合其他一些技巧(例如检查 url 是否位于同一域以及正确的内容类型等),它可以成为你所需要的
Here is how I do it using a white-listing approach
(Javascript and Python code)
https://github.com/dcollien/FilterHTML
I define a specification for a subset of allowed HTML, and that is only what should get through this filter.
There's some options to also purify URL attributes, by only allowing certain schemes (like http:, ftp:, etc.) and disallowing those that would cause XSS/Javascript problems (like javascript:, or even data:)
edit: This isn't going to give you 100% safety out of the box for all situations, but used intelligently and in conjunction with a few other tricks (like checking if urls are on the same domain, and the correct content-type, etc.) it could be what you need
如果您希望更改 html,以便用户可以看到 HTML 代码本身。 对所有 '<'、'>'、'&' 进行字符串替换 和 ';'。 例如“<” 变成“<”。
如果您希望 html 正常工作,最简单的方法是删除所有 HTML 和 Javascript,然后仅替换 HTML。 不幸的是,几乎没有确定的方法可以删除所有 javascript 并仅允许 HTML。
例如,您可能想要允许图像。 但是您可能不知道您可以执行该操作
并且它可以运行该脚本。 它很快就会变得非常不安全$。 这就是为什么大多数网站(如维基百科和本网站)使用特殊的 Markdown 语言。 这使得允许格式化但不允许恶意 JavaScript 变得更加容易。
If you want the html to be changed so users can see the HTML code itself. Do a string replace of all '<', '>', '&' and ';'. For example '<' becomes '<'.
If you want the html to work, the easiest way is to remove all HTML and Javascript and then replace the HTML only. Unfortunately there is almost not sure way of removing all javascript and allowing only HTML.
For example you may want to allow images. However you may not know that you can do
and it can run that script. It becomes very unsafe very fast$. This is why most websites like Wikipedia and this website use special markdown language. This makes it much easier to allow formatting but not malicious javascript.
您可能想检查一些基于浏览器的所见即所得编辑器(例如 TinyMCE)的工作方式。 他们通常会删除 JS,并且似乎在这方面做得不错。
You may want to check how some browser based WYSIWYG editors such as TinyMCE do. They usually remove JS and seem to do a resonable job at it.
最简单的方法是使用正则表达式删除标签。 问题是,如果没有脚本标签,您可能会做很多令人讨厌的事情(例如嵌入不可靠的图像,链接到其他具有令人讨厌的 Javascript 的网站)。 通过将小于/大于字符转换为其 HTML 实体形式(例如 <)来完全禁用 HTML 也可能是一种选择。
如果您想要更强大的解决方案,过去我使用过 AntiSamy 清理传入的文本,以便安全查看。
The simplest thing to do would be to either strip out tags with a regex. Trouble is that you could do plenty of nasty things without script tags (e.g. imbed dodgy images, have links to other sites that have nasty Javascript) . Disabling HTML completely by convert the less than/greater than characters into their HTML entities forms (e.g. <) could also be an option.
If you want a more powerful solution, in the past I have used AntiSamy to sanitize incoming text so that it's safe for viewing.