当前位置：文江博客话题详情

从 HTML 中过滤 JavaScript

发布于 2024-07-19 07:53:44 字数 137 浏览 13 评论 0原文

我有一个富文本编辑器，可以将 HTML 传递到服务器。然后该 HTML 会显示给其他用户。我想确保该 HTML 中没有 JavaScript。有什么办法可以做到这一点吗？

另外，如果有帮助的话，我正在使用 ASP.NET。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

满地尘埃落定 2024-07-26 07:53:44

确保某些 HTML 标记不包含任何 JavaScript 的唯一方法是过滤掉所有不安全的 HTML 标记和属性，以防止跨站脚本 (XSS)。

但是，通常没有可靠的方法可以通过名称显式删除所有不安全的元素和属性，因为某些浏览器可能会解释您在设计时甚至不知道的元素和属性，因此为恶意用户打开安全漏洞。这就是为什么您最好采用白名单方法而不是黑名单方法。也就是说，仅允许您确定安全的 HTML 标记，并默认删除所有其他标记。事实上，只有一个意外允许的标签就会使您的网站容易受到 XSS 攻击。

白名单（好方法）

请参阅这篇关于 HTML 清理的文章，其中提供了一些特定的信息为什么应该将其列入白名单而不是黑名单的示例。引用该页面的内容：

以下是潜在危险 HTML 标记和属性的不完整列表：
脚本，可能包含恶意脚本
applet、embed和object，可以自动下载并执行恶意代码
meta，可能包含恶意重定向
onload、onunload 和所有其他 on* 属性，可能包含恶意脚本
style、link 和 style 属性，可能包含恶意脚本

< a href="http://software.open-xchange.com/OX6/doc/Html-Whitelist/ch01.html" rel="nofollow noreferrer">这里是另一个有用的页面，建议一组 HTML标签和属性以及通常安全允许的 CSS 属性，以及推荐的做法。

黑名单（通常是不好的方法）

尽管许多网站过去（和目前）都使用黑名单方法，但几乎从来没有真正需要它。（安全风险总是超过白名单通过授予用户的格式化功能所强制执行的潜在限制。）您需要非常清楚它的缺陷。

例如，此页面给出了所谓“所有”的列表您可能想要删除 HTML 标签。只需简单观察一下，您就会发现它包含的元素名称数量非常有限；浏览器很容易包含一个专有标签，无意中允许脚本在您的页面上运行，这本质上是黑名单的主要问题。

最后，我强烈建议您使用 HTML DOM 库（例如众所周知的 HTML Agility Pack) for .NET，而不是 RegEx 来执行清理/白名单，因为它会更加可靠。（很可能创建一些非常疯狂的混淆 HTML 来愚弄正则表达式！无论如何，一个合适的 HTML 阅读器/编写器可以使系统编码变得更加容易。）

希望这能让您对需要设计的内容有一个很好的概述为了完全（或至少最大限度地）防止 XSS，以及在考虑未知因素的情况下执行 HTML 清理的重要性。

The only way to ensure that some HTML markup does not contain any JavaScript is to filter it of all unsafe HTML tags and attributes, in order to prevent Cross-Site Scripting (XSS).

However, there is in general no reliable way of explicitly removing all unsafe elements and attributes by their names, since certain browsers may interpret ones of which you weren't even aware at the time of design, and thus open up a security hole for malicious users. This is why you're much better off taking a whitelisting approach rather than a blacklisting one. That is to say, only allow HTML tags that you are sure are safe, and stripping all others by default. Indeed, only one accidentally permitted tag can make your website vulnerable to XSS.

Whitelisting (good approach)

See this article on HTML sanitisation, which offers some specific examples of why you should whitelist rather than blacklist. Quote from that page:

Here is an incomplete list of potentially dangerous HTML tags and attributes:
script, which can contain malicious script
applet, embed, and object, which can automatically download and execute malicious code
meta, which can contain malicious redirects
onload, onunload, and all other on* attributes, which can contain malicious script
style, link, and the style attribute, which can contain malicious script

Here is another helpful page that suggests a set of HTML tags & attributes as well as CSS attributes that are typically safe to allow, as well as recommended practices.

Blacklisting (generally bad approach)

Although many website have in the past (and currently) use the blacklisting approach, there is almost never any true need for it. (The security risks invariably outweight the potential limitations whitelisting enforces with the formatting capabilities that are granted to the user.) You need to be very aware of its flaws.

For example, this page gives a list of what are supposedly "all" the HTML tags you might want to strip out. Just from observing it briefly, you should notice that it contains a very limited number of element names; a browser could easily include a proprietary tag that unwittingly allowed scripts to run on your page, which is essentially the main problem with blacklisting.

Finally, I would strongly recommend that you utilise an HTML DOM library (such as the well-known HTML Agility Pack) for .NET, as opposed to RegEx to perform the cleaning/whitelisting, since it would be significantly more reliable. (It is quite possible to create some pretty crazy obfuscated HTML that can fool regexes! A proper HTML reader/writer makes the coding of the system much easier, anyway.)

Hopefully that should given you a decent overview of what you need to design in order to fully (or at least maximally) prevent XSS, and how it's critical that HTML sanitisation is performed with the unknown factor in mind.

回复收藏 0 原文

岁月静好 2024-07-26 07:53:44

正如李·西奥博尔德（Lee Theobald）所指出的，这是一个非常危险的计划。根据定义，您无法通过过滤/黑名单生成“安全”HTML，因为用户可能会将您没有想到的内容放入 HTML 中（或者甚至在您的浏览器版本中不存在，但在其他浏览器版本中存在）。

唯一安全的方法是白名单方法，即删除除纯文本和某些特定 HTML 结构之外的所有内容。顺便说一句，这就是 stackoverflow.com 所做的:-)。

回复收藏 0 原文

彼岸花ソ最美的依靠 2024-07-26 07:53:44

这是我使用白名单方法的方法
（Javascript 和 Python 代码）

https://github.com/dcollien/FilterHTML

我定义了一个规范允许的 HTML 子集，这只是应该通过此过滤器的内容。
还有一些选项可以通过仅允许某些方案（例如 http:、ftp: 等）并禁止那些会导致 XSS/Javascript 问题的方案（例如 javascript: 甚至 data:）来净化 URL 属性

编辑：这不是它不会在所有情况下为您提供 100% 开箱即用的安全性，但明智地使用并结合其他一些技巧（例如检查 url 是否位于同一域以及正确的内容类型等），它可以成为你所需要的

回复收藏 0 原文

忆梦 2024-07-26 07:53:44

如果您希望更改 html，以便用户可以看到 HTML 代码本身。对所有 '<'、'>'、'&' 进行字符串替换和 ';'。例如“<” 变成“<”。

如果您希望 html 正常工作，最简单的方法是删除所有 HTML 和 Javascript，然后仅替换 HTML。不幸的是，几乎没有确定的方法可以删除所有 javascript 并仅允许 HTML。

例如，您可能想要允许图像。但是您可能不知道您可以执行该操作

<img src='evilscript.js'>

并且它可以运行该脚本。它很快就会变得非常不安全$。这就是为什么大多数网站（如维基百科和本网站）使用特殊的 Markdown 语言。这使得允许格式化但不允许恶意 JavaScript 变得更加容易。

If you want the html to be changed so users can see the HTML code itself. Do a string replace of all '<', '>', '&' and ';'. For example '<' becomes '<'.

If you want the html to work, the easiest way is to remove all HTML and Javascript and then replace the HTML only. Unfortunately there is almost not sure way of removing all javascript and allowing only HTML.

For example you may want to allow images. However you may not know that you can do

<img src='evilscript.js'>

and it can run that script. It becomes very unsafe very fast$. This is why most websites like Wikipedia and this website use special markdown language. This makes it much easier to allow formatting but not malicious javascript.

回复收藏 0 原文