在客户端清理/重写 HTML

发布于 2024-07-09 04:36:19 字数 419 浏览 7 评论 0 原文

我需要显示通过跨域请求加载的外部资源,并确保仅显示“安全”内容。

可以使用 Prototype 的 String#stripScripts 删除脚本块。 但诸如 onclickonerror 之类的处理程序仍然存在。

是否有任何库至少可以

  • 剥离脚本块、
  • 杀死 DOM 处理程序、
  • 删除黑名单标签(例如:embedobject)。

那么有没有 JavaScript 相关的链接和示例呢?

I need to display external resources loaded via cross domain requests and make sure to only display "safe" content.

Could use Prototype's String#stripScripts to remove script blocks. But handlers such as onclick or onerror are still there.

Is there any library which can at least

  • strip script blocks,
  • kill DOM handlers,
  • remove black listed tags (eg: embed or object).

So are any JavaScript related links and examples out there?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

你的呼吸 2024-07-16 04:36:19

2016 年更新:现在有一个 Google 关闭 基于 Caja 消毒剂的包装。

它有一个更干净的 API,经过重写以考虑现代浏览器上可用的 API,并且与 Closure Compiler 更好地交互。


无耻插件:请参阅 caja/plugin /html-sanitizer.js 用于经过彻底审查的客户端 html 清理程序。

它已列入白名单,而不是黑名单,但白名单可根据 CajaWhitelists 进行配置


如果您想删除所有标签,请执行以下操作:

var tagBody = '(?:[^"\'>]|"[^"]*"|\'[^\']*\')*';

var tagOrComment = new RegExp(
    '<(?:'
    // Comment body.
    + '!--(?:(?:-*[^->])*--+|-?)'
    // Special "raw text" elements whose content should be elided.
    + '|script\\b' + tagBody + '>[\\s\\S]*?</script\\s*'
    + '|style\\b' + tagBody + '>[\\s\\S]*?</style\\s*'
    // Regular name
    + '|/?[a-z]'
    + tagBody
    + ')>',
    'gi');
function removeTags(html) {
  var oldHtml;
  do {
    oldHtml = html;
    html = html.replace(tagOrComment, '');
  } while (html !== oldHtml);
  return html.replace(/</g, '<');
}

人们会告诉您可以创建一个元素,并分配 innerHTML ,然后获取 innerText textContent,然后转义其中的实体。 不要那样做。 它很容易受到 XSS 注入的攻击,因为即使节点从未附加到 DOM, 也会运行 onerror 处理程序。

Update 2016: There is now a Google Closure package based on the Caja sanitizer.

It has a cleaner API, was rewritten to take into account APIs available on modern browsers, and interacts better with Closure Compiler.


Shameless plug: see caja/plugin/html-sanitizer.js for a client side html sanitizer that has been thoroughly reviewed.

It is white-listed, not black-listed, but the whitelists are configurable as per CajaWhitelists


If you want to remove all tags, then do the following:

var tagBody = '(?:[^"\'>]|"[^"]*"|\'[^\']*\')*';

var tagOrComment = new RegExp(
    '<(?:'
    // Comment body.
    + '!--(?:(?:-*[^->])*--+|-?)'
    // Special "raw text" elements whose content should be elided.
    + '|script\\b' + tagBody + '>[\\s\\S]*?</script\\s*'
    + '|style\\b' + tagBody + '>[\\s\\S]*?</style\\s*'
    // Regular name
    + '|/?[a-z]'
    + tagBody
    + ')>',
    'gi');
function removeTags(html) {
  var oldHtml;
  do {
    oldHtml = html;
    html = html.replace(tagOrComment, '');
  } while (html !== oldHtml);
  return html.replace(/</g, '<');
}

People will tell you that you can create an element, and assign innerHTML and then get the innerText or textContent, and then escape entities in that. Do not do that. It is vulnerable to XSS injection since <img src=bogus onerror=alert(1337)> will run the onerror handler even if the node is never attached to the DOM.

梦旅人picnic 2024-07-16 04:36:19

Google Caja HTML sanitizer 可以通过将其嵌入网络工作器来使其“网络就绪”一个>。 清理程序引入的任何全局变量都将包含在工作线程中,并且处理发生在它自己的线程中。

对于不支持 Web Workers 的浏览器,我们可以使用 iframe 作为消毒程序工作的单独环境。 Timothy Chien 有一个 polyfill 就是这样做的,使用 iframe 来模拟 Web Workers,所以这部分已经为我们完成了。

Caja 项目有一个 wiki 页面,介绍如何使用 Caja 作为独立的客户端清理程序

  • 检查源代码,然后通过运行 ant 进行构建,
  • 包括 html-sanitizer-minified.jshtml-css-sanitizer-minified.js< /code> 在你的页面中
  • 调用 html_sanitize(...)

工作脚本只需要遵循这些说明:(

importScripts('html-css-sanitizer-minified.js'); // or 'html-sanitizer-minified.js'

var urlTransformer, nameIdClassTransformer;

// customize if you need to filter URLs and/or ids/names/classes
urlTransformer = nameIdClassTransformer = function(s) { return s; };

// when we receive some HTML
self.onmessage = function(event) {
    // sanitize, then send the result back
    postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));
};

需要更多代码才能让 simworker 库工作,但这对此并不重要讨论。)

演示:https://dl.dropbox.com/u/ 291406/html-sanitize/demo.html

The Google Caja HTML sanitizer can be made "web-ready" by embedding it in a web worker. Any global variables introduced by the sanitizer will be contained within the worker, plus processing takes place in its own thread.

For browsers that do not support Web Workers, we can use an iframe as a separate environment for the sanitizer to work in. Timothy Chien has a polyfill that does just this, using iframes to simulate Web Workers, so that part is done for us.

The Caja project has a wiki page on how to use Caja as a standalone client-side sanitizer:

  • Checkout the source, then build by running ant
  • Include html-sanitizer-minified.js or html-css-sanitizer-minified.js in your page
  • Call html_sanitize(...)

The worker script only needs to follow those instructions:

importScripts('html-css-sanitizer-minified.js'); // or 'html-sanitizer-minified.js'

var urlTransformer, nameIdClassTransformer;

// customize if you need to filter URLs and/or ids/names/classes
urlTransformer = nameIdClassTransformer = function(s) { return s; };

// when we receive some HTML
self.onmessage = function(event) {
    // sanitize, then send the result back
    postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));
};

(A bit more code is needed to get the simworker library working, but it's not important to this discussion.)

Demo: https://dl.dropbox.com/u/291406/html-sanitize/demo.html

绳情 2024-07-16 04:36:19

永远不要相信客户。 如果您正在编写服务器应用程序,请假设客户端始终会提交不卫生的恶意数据。 这是一条经验法则,可以让您远离麻烦。 如果可以的话,我建议在服务器代码中进行所有验证和清理,您知道(在合理的程度上)不会被摆弄。 也许您可以使用服务器端 Web 应用程序作为客户端代码的代理,该代码从第 3 方获取并在将其发送到客户端本身之前进行清理?

[编辑] 抱歉,我误解了这个问题。 不过,我坚持我的建议。 如果您在将服务器发送给他们之前先对其进行清理,那么您的用户可能会更安全。

Never trust the client. If you're writing a server application, assume that the client will always submit unsanitary, malicious data. It's a rule of thumb that will keep you out of trouble. If you can, I would advise doing all validation and sanitation in server code, which you know (to a reasonable degree) won't be fiddled with. Perhaps you could use a serverside web application as a proxy for your clientside code, which fetches from the 3rd party and does sanitation before sending it to the client itself?

[edit] I'm sorry, I misunderstood the question. However, I stand by my advice. Your users will probably be safer if you sanitize on the server before sending it to them.

离笑几人歌 2024-07-16 04:36:19

现在所有主要浏览器都支持沙盒 iframe,我认为有一种更简单的方法可以保证安全。 如果更熟悉此类安全问题的人能够审查此答案,我会很高兴。

注意:此方法在 IE 9 及更早版本中肯定不起作用。 请参阅此表了解支持沙箱的浏览器版本。(注意:该表似乎说它在 Opera Mini 中不起作用,但我刚刚尝试过,它起作用了。)

这个想法是创建一个禁用 JavaScript 的隐藏 iframe,将不受信任的 HTML 粘贴到其中,然后让它解析它。 然后你可以遍历 DOM 树并复制出被认为安全的标签和属性。

此处显示的白名单只是示例。 最好将什么列入白名单取决于应用程序。 如果您需要比标签和属性白名单更复杂的策略,则可以通过此方法来满足,但不能通过此示例代码来满足。

var tagWhitelist_ = {
  'A': true,
  'B': true,
  'BODY': true,
  'BR': true,
  'DIV': true,
  'EM': true,
  'HR': true,
  'I': true,
  'IMG': true,
  'P': true,
  'SPAN': true,
  'STRONG': true
};

var attributeWhitelist_ = {
  'href': true,
  'src': true
};

function sanitizeHtml(input) {
  var iframe = document.createElement('iframe');
  if (iframe['sandbox'] === undefined) {
    alert('Your browser does not support sandboxed iframes. Please upgrade to a modern browser.');
    return '';
  }
  iframe['sandbox'] = 'allow-same-origin';
  iframe.style.display = 'none';
  document.body.appendChild(iframe); // necessary so the iframe contains a document
  iframe.contentDocument.body.innerHTML = input;
  
  function makeSanitizedCopy(node) {
    if (node.nodeType == Node.TEXT_NODE) {
      var newNode = node.cloneNode(true);
    } else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
      newNode = iframe.contentDocument.createElement(node.tagName);
      for (var i = 0; i < node.attributes.length; i++) {
        var attr = node.attributes[i];
        if (attributeWhitelist_[attr.name]) {
          newNode.setAttribute(attr.name, attr.value);
        }
      }
      for (i = 0; i < node.childNodes.length; i++) {
        var subCopy = makeSanitizedCopy(node.childNodes[i]);
        newNode.appendChild(subCopy, false);
      }
    } else {
      newNode = document.createDocumentFragment();
    }
    return newNode;
  };

  var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
  document.body.removeChild(iframe);
  return resultElement.innerHTML;
};

安全漏洞:评论者 @Explosion 指出 href 属性可以包含 JavaScript,例如 。 应该可以捕获它并在清理代码中将其删除,但上面的代码尚未更新来做到这一点。

您可以在此处尝试一下。

请注意,在此示例中我不允许使用样式属性和标签。 如果您允许它们,您可能需要解析 CSS 并确保它对于您的目的来说是安全的。

我已经在几种现代浏览器(Chrome 40、Firefox 36 Beta、IE 11、Chrome for Android)和一款旧浏览器(IE 8)上对此进行了测试,以确保它在执行任何脚本之前可以正常运行。 我很想知道是否有任何浏览器遇到问题,或者我忽略了任何边缘情况。

Now that all major browsers support sandboxed iframes, there is a much simpler way that I think can be secure. I'd love it if this answer could be reviewed by people who are more familiar with this kind of security issue.

NOTE: This method definitely will not work in IE 9 and earlier. See this table for browser versions that support sandboxing. (Note: the table seems to say it won't work in Opera Mini, but I just tried it, and it worked.)

The idea is to create a hidden iframe with JavaScript disabled, paste your untrusted HTML into it, and let it parse it. Then you can walk the DOM tree and copy out the tags and attributes that are considered safe.

The whitelists shown here are just examples. What's best to whitelist would depend on the application. If you need a more sophisticated policy than just whitelists of tags and attributes, that can be accommodated by this method, though not by this example code.

var tagWhitelist_ = {
  'A': true,
  'B': true,
  'BODY': true,
  'BR': true,
  'DIV': true,
  'EM': true,
  'HR': true,
  'I': true,
  'IMG': true,
  'P': true,
  'SPAN': true,
  'STRONG': true
};

var attributeWhitelist_ = {
  'href': true,
  'src': true
};

function sanitizeHtml(input) {
  var iframe = document.createElement('iframe');
  if (iframe['sandbox'] === undefined) {
    alert('Your browser does not support sandboxed iframes. Please upgrade to a modern browser.');
    return '';
  }
  iframe['sandbox'] = 'allow-same-origin';
  iframe.style.display = 'none';
  document.body.appendChild(iframe); // necessary so the iframe contains a document
  iframe.contentDocument.body.innerHTML = input;
  
  function makeSanitizedCopy(node) {
    if (node.nodeType == Node.TEXT_NODE) {
      var newNode = node.cloneNode(true);
    } else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
      newNode = iframe.contentDocument.createElement(node.tagName);
      for (var i = 0; i < node.attributes.length; i++) {
        var attr = node.attributes[i];
        if (attributeWhitelist_[attr.name]) {
          newNode.setAttribute(attr.name, attr.value);
        }
      }
      for (i = 0; i < node.childNodes.length; i++) {
        var subCopy = makeSanitizedCopy(node.childNodes[i]);
        newNode.appendChild(subCopy, false);
      }
    } else {
      newNode = document.createDocumentFragment();
    }
    return newNode;
  };

  var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
  document.body.removeChild(iframe);
  return resultElement.innerHTML;
};

SECURITY HOLE: Commenter @Explosion points out that an href attribute can contain JavaScript, like <a href="javascript:alert('Oops')">. It should be possible to catch that and remove it in the sanitization code, but the above code has not (yet) been updated to do that.

You can try it out here.

Note that I'm disallowing style attributes and tags in this example. If you allowed them, you'd probably want to parse the CSS and make sure it's safe for your purposes.

I've tested this on several modern browsers (Chrome 40, Firefox 36 Beta, IE 11, Chrome for Android), and on one old one (IE 8) to make sure it bailed before executing any scripts. I'd be interested to know if there are any browsers that have trouble with it, or any edge cases that I'm overlooking.

寒江雪… 2024-07-16 04:36:19

现在是 2016 年,我想我们中的许多人现在都在代码中使用 npm 模块。 不过 sanitize-html 似乎是 npm 上的领先选项还有其他

这个问题的其他答案为如何推出自己的解决方案提供了很好的建议,但这是一个足够棘手的问题,经过充分测试的社区解决方案可能是最好的答案。

在命令行上运行此命令进行安装:
<代码>
npm install --save sanitize-html
ES5


<代码>
var sanitizeHtml = require('sanitize-html');
// ...
var sanitized = sanitizeHtml(htmlInput);
ES6


<代码>
从“sanitize-html”导入 sanitizeHtml;
// ...
让 sanitized = sanitizeHtml(htmlInput);

So, it's 2016, and I think many of us are using npm modules in our code now. sanitize-html seems like the leading option on npm, though there are others.

Other answers to this question provide great input in how to roll your own, but this is a tricky enough problem that well-tested community solutions are probably the best answer.

Run this on the command line to install:

npm install --save sanitize-html

ES5:

var sanitizeHtml = require('sanitize-html');
// ...
var sanitized = sanitizeHtml(htmlInput);

ES6:

import sanitizeHtml from 'sanitize-html';
// ...
let sanitized = sanitizeHtml(htmlInput);

手长情犹 2024-07-16 04:36:19

您无法预见某些浏览器可能会遇到的每种可能的奇怪类型的格式错误标记,以逃避黑名单,因此不要列入黑名单。 除了脚本/嵌入/对象和处理程序之外,您可能还需要删除更多的结构。

相反,尝试将 HTML 解析为层次结构中的元素和属性,然后根据尽可能最小的白名单运行所有元素和属性名称。 还要根据白名单检查您允许通过的任何 URL 属性(请记住,还有比 javascript 更危险的协议:)。

如果输入是格式良好的 XHTML,则上述第一部分会容易得多。

与 HTML 清理一样,如果您可以找到任何其他方法来避免这样做,那就这样做。 有很多很多潜在的漏洞。 如果主要的网络邮件服务在这么多年后仍然发现漏洞,是什么让您认为自己可以做得更好?

You can't anticipate every possible weird type of malformed markup that some browser somewhere might trip over to escape blacklisting, so don't blacklist. There are many more structures you might need to remove than just script/embed/object and handlers.

Instead attempt to parse the HTML into elements and attributes in a hierarchy, then run all element and attribute names against an as-minimal-as-possible whitelist. Also check any URL attributes you let through against a whitelist (remember there are more dangerous protocols than just javascript:).

If the input is well-formed XHTML the first part of the above is much easier.

As always with HTML sanitisation, if you can find any other way to avoid doing it, do that instead. There are many, many potential holes. If the major webmail services are still finding exploits after this many years, what makes you think you can do better?

倒数 2024-07-16 04:36:19

[免责声明:我是作者之一]

我们为此编写了一个“仅限网络”(即“需要浏览器”)的开源库,https://github.com/jitbit/HtmlSanitizer 删除除“白名单”之外的所有标签/属性/样式

用法:

var input = HtmlSanitizer.SanitizeHtml("<script> Alert('xss!'); </scr"+"ipt>");

PS 的工作速度比“纯 JavaScript”解决方案快得多,因为它使用浏览器来解析和操作 DOM。 如果您对“纯 JS”解决方案感兴趣,请尝试 https://github.com/punkave/ sanitize-html(不附属)

[Disclaimer: I'm one of the authors]

We wrote a "web-only" (i.e. "requires a browser") open source library for this, https://github.com/jitbit/HtmlSanitizer that removes all tags/attributes/styles except the "whitelisted" ones.

Usage:

var input = HtmlSanitizer.SanitizeHtml("<script> Alert('xss!'); </scr"+"ipt>");

P.S. Works much faster than a "pure JavaScript" solution since it uses the browser to parse and manipulate DOM. If you're interested in a "pure JS" solution please try https://github.com/punkave/sanitize-html (not affiliated)

柠栀 2024-07-16 04:36:19

上面建议的 Google Caja 库太复杂,无法配置并包含在我的 Web 应用程序项目中(因此,在浏览器上运行)。 因为我们已经使用了 CKEditor 组件,所以我所采取的方法是使用它的内置 HTML 清理和白名单功能,这更容易配置。 因此,您可以在隐藏的 iframe 中加载 CKEditor 实例并执行以下操作:

CKEDITOR.instances['myCKEInstance'].dataProcessor.toHtml(myHTMLstring)

现在,当然,如果您没有在项目中使用 CKEditor,这可能有点过大了,因为组件本身大约有半兆字节(最小化),但如果您有源代码,也许您可​​以隔离执行白名单的代码(CKEDITOR.htmlParser?)并使其更短。

http://docs.ckeditor.com/#!/api

http://docs.ckeditor.com/#!/api/CKEDITOR.htmlDataProcessor

The Google Caja library suggested above was way too complex to configure and include in my project for a Web application (so, running on the browser). What I resorted to instead, since we already use the CKEditor component, is to use it's built-in HTML sanitizing and whitelisting function, which is far more easier to configure. So, you can load a CKEditor instance in a hidden iframe and do something like:

CKEDITOR.instances['myCKEInstance'].dataProcessor.toHtml(myHTMLstring)

Now, granted, if you're not using CKEditor in your project this may be a bit of an overkill, since the component itself is around half a megabyte (minimized), but if you have the sources, maybe you can isolate the code doing the whitelisting (CKEDITOR.htmlParser?) and make it much shorter.

http://docs.ckeditor.com/#!/api

http://docs.ckeditor.com/#!/api/CKEDITOR.htmlDataProcessor

荒人说梦 2024-07-16 04:36:19

我没有使用正则表达式,而是想到了一种使用本机 DOM 内容的方法。 通过这种方式,您可以将 HTML 解析为文档、获取该 HTML 并轻松获取所有特定元素以及要删除的白名单元素和属性。 它使用属性列表作为允许的简单属性字符串数组,或者可以使用正则表达式来验证它们的值并仅允许某些标签。

const sanitize = (html, tags = undefined, attributes = undefined) => {
    var attributes = attributes || [
      { attribute: "src", tags: "*", regex: /^(?:https|http|\/\/):/ },
      { attribute: "href", tags: "*", regex: /^(?!javascript:).+/ },
      { attribute: "width", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "height", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "id", tags: "*", regex: /^[a-zA-Z]+$/ },
      { attribute: "class", tags: "*", regex: /^[a-zA-Z ]+$/ },
      { attribute: "value", tags: ["INPUT", "TEXTAREA"], regex: /^.+$/ },
      { attribute: "checked", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      {
        attribute: "placeholder",
        tags: ["INPUT", "TEXTAREA"],
        regex: /^.+$/,
      },
      {
        attribute: "alt",
        tags: ["IMG", "AREA", "INPUT"],
        //"^" and "$" match beggining and end
        regex: /^[0-9a-zA-Z]+$/,
      },
      { attribute: "autofocus", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      { attribute: "for", tags: ["LABEL", "OUTPUT"], regex: /^[a-zA-Z0-9]+$/ },
    ]
    var tags = tags || [
      "I",
      "P",
      "B",
      "BODY",
      "HTML",
      "DEL",
      "INS",
      "STRONG",
      "SMALL",
      "A",
      "IMG",
      "CITE",
      "FIGCAPTION",
      "ASIDE",
      "ARTICLE",
      "SUMMARY",
      "DETAILS",
      "NAV",
      "TD",
      "TH",
      "TABLE",
      "THEAD",
      "TBODY",
      "NAV",
      "SPAN",
      "BR",
      "CODE",
      "PRE",
      "BLOCKQUOTE",
      "EM",
      "HR",
      "H1",
      "H2",
      "H3",
      "H4",
      "H5",
      "H6",
      "DIV",
      "MAIN",
      "HEADER",
      "FOOTER",
      "SELECT",
      "COL",
      "AREA",
      "ADDRESS",
      "ABBR",
      "BDI",
      "BDO",
    ]

    attributes = attributes.map((el) => {
      if (typeof el === "string") {
        return { attribute: el, tags: "*", regex: /^.+$/ }
      }
      let output = el
      if (!el.hasOwnProperty("tags")) {
        output.tags = "*"
      }
      if (!el.hasOwnProperty("regex")) {
        output.regex = /^.+$/
      }
      return output
    })
    var el = new DOMParser().parseFromString(html, "text/html")
    var elements = el.querySelectorAll("*")
    for (let i = 0; i < elements.length; i++) {
      const current = elements[i]
      let attr_list = get_attributes(current)
      for (let j = 0; j < attr_list.length; j++) {
        const attribute = attr_list[j]
        if (!attribute_matches(current, attribute)) {
          current.removeAttribute(attr_list[j])
        }
      }
      if (!tags.includes(current.tagName)) {
        current.remove()
      }
    }
    return el.documentElement.innerHTML
    function attribute_matches(element, attribute) {
      let output = attributes.filter((attr) => {
        let returnval =
          attr.attribute === attribute &&
          (attr.tags === "*" || attr.tags.includes(element.tagName)) &&
          attr.regex.test(element.getAttribute(attribute))
        return returnval
      })

      return output.length > 0
    }
    function get_attributes(element) {
      for (
        var i = 0, atts = element.attributes, n = atts.length, arr = [];
        i < n;
        i++
      ) {
        arr.push(atts[i].nodeName)
      }
      return arr
    }
  }
* {
  font-family: sans-serif;
}
textarea {
  width: 49%;
  height: 300px;
  padding: 10px;
  box-sizing: border-box;
  resize: none;
}
<h1>Sanitize HTML client side</h1>
<textarea id='input' placeholder="Unsanitized HTML">
<!-- This removes both the src and onerror attributes because src is not a valid url. -->
<img src="error" onerror="alert('XSS')">
<div id="something_harmless" onload="alert('More XSS')">
   <b>Bold text!</b> and <em>Italic text!</em>, some more text. <del>Deleted text!</del>
</div>
 <script>
    alert("This would be XSS");
  </script>
</textarea>
<textarea id='output' placeholder="Sanitized HTML will appear here" readonly></textarea>
<script>
  document.querySelector("#input").onkeyup = () => {
    document.querySelector("#output").value = sanitize(document.querySelector("#input").value);
  }
</script>

Instead of using regex,I thought of a way using native DOM stuff. This way you can parse the HTML to a doc, get that HTML and easily get all of a certain element and whitelist elements and attributes to remove. This uses a list of attributes as either an array of simple strings of attributes to allow, or it can use a regex to validate their values and only allow on certain tags.

const sanitize = (html, tags = undefined, attributes = undefined) => {
    var attributes = attributes || [
      { attribute: "src", tags: "*", regex: /^(?:https|http|\/\/):/ },
      { attribute: "href", tags: "*", regex: /^(?!javascript:).+/ },
      { attribute: "width", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "height", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "id", tags: "*", regex: /^[a-zA-Z]+$/ },
      { attribute: "class", tags: "*", regex: /^[a-zA-Z ]+$/ },
      { attribute: "value", tags: ["INPUT", "TEXTAREA"], regex: /^.+$/ },
      { attribute: "checked", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      {
        attribute: "placeholder",
        tags: ["INPUT", "TEXTAREA"],
        regex: /^.+$/,
      },
      {
        attribute: "alt",
        tags: ["IMG", "AREA", "INPUT"],
        //"^" and "$" match beggining and end
        regex: /^[0-9a-zA-Z]+$/,
      },
      { attribute: "autofocus", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      { attribute: "for", tags: ["LABEL", "OUTPUT"], regex: /^[a-zA-Z0-9]+$/ },
    ]
    var tags = tags || [
      "I",
      "P",
      "B",
      "BODY",
      "HTML",
      "DEL",
      "INS",
      "STRONG",
      "SMALL",
      "A",
      "IMG",
      "CITE",
      "FIGCAPTION",
      "ASIDE",
      "ARTICLE",
      "SUMMARY",
      "DETAILS",
      "NAV",
      "TD",
      "TH",
      "TABLE",
      "THEAD",
      "TBODY",
      "NAV",
      "SPAN",
      "BR",
      "CODE",
      "PRE",
      "BLOCKQUOTE",
      "EM",
      "HR",
      "H1",
      "H2",
      "H3",
      "H4",
      "H5",
      "H6",
      "DIV",
      "MAIN",
      "HEADER",
      "FOOTER",
      "SELECT",
      "COL",
      "AREA",
      "ADDRESS",
      "ABBR",
      "BDI",
      "BDO",
    ]

    attributes = attributes.map((el) => {
      if (typeof el === "string") {
        return { attribute: el, tags: "*", regex: /^.+$/ }
      }
      let output = el
      if (!el.hasOwnProperty("tags")) {
        output.tags = "*"
      }
      if (!el.hasOwnProperty("regex")) {
        output.regex = /^.+$/
      }
      return output
    })
    var el = new DOMParser().parseFromString(html, "text/html")
    var elements = el.querySelectorAll("*")
    for (let i = 0; i < elements.length; i++) {
      const current = elements[i]
      let attr_list = get_attributes(current)
      for (let j = 0; j < attr_list.length; j++) {
        const attribute = attr_list[j]
        if (!attribute_matches(current, attribute)) {
          current.removeAttribute(attr_list[j])
        }
      }
      if (!tags.includes(current.tagName)) {
        current.remove()
      }
    }
    return el.documentElement.innerHTML
    function attribute_matches(element, attribute) {
      let output = attributes.filter((attr) => {
        let returnval =
          attr.attribute === attribute &&
          (attr.tags === "*" || attr.tags.includes(element.tagName)) &&
          attr.regex.test(element.getAttribute(attribute))
        return returnval
      })

      return output.length > 0
    }
    function get_attributes(element) {
      for (
        var i = 0, atts = element.attributes, n = atts.length, arr = [];
        i < n;
        i++
      ) {
        arr.push(atts[i].nodeName)
      }
      return arr
    }
  }
* {
  font-family: sans-serif;
}
textarea {
  width: 49%;
  height: 300px;
  padding: 10px;
  box-sizing: border-box;
  resize: none;
}
<h1>Sanitize HTML client side</h1>
<textarea id='input' placeholder="Unsanitized HTML">
<!-- This removes both the src and onerror attributes because src is not a valid url. -->
<img src="error" onerror="alert('XSS')">
<div id="something_harmless" onload="alert('More XSS')">
   <b>Bold text!</b> and <em>Italic text!</em>, some more text. <del>Deleted text!</del>
</div>
 <script>
    alert("This would be XSS");
  </script>
</textarea>
<textarea id='output' placeholder="Sanitized HTML will appear here" readonly></textarea>
<script>
  document.querySelector("#input").onkeyup = () => {
    document.querySelector("#output").value = sanitize(document.querySelector("#input").value);
  }
</script>

帅气尐潴 2024-07-16 04:36:19

我建议把框架从你的生活中剔除,从长远来看,这会让你的事情变得更加容易。

cloneNode:克隆节点会复制其所有属性及其值,但不会复制事件侦听器

https://developer.mozilla.org/en/DOM/Node.cloneNode

尽管我已经使用 Treewalkers 一段时间了,但以下内容尚未经过测试,它们是 JavaScript 中最被低估的部分之一。 下面列出了您可以抓取的节点类型,通常我使用 SHOW_ELEMENTSHOW_TEXT

http://www.w3.org /TR/DOM-Level-2-Traversal-Range/traversal.html#Traversal-NodeFilter

function xhtml_cleaner(id)
{
 var e = document.getElementById(id);
 var f = document.createDocumentFragment();
 f.appendChild(e.cloneNode(true));

 var walker = document.createTreeWalker(f,NodeFilter.SHOW_ELEMENT,null,false);

 while (walker.nextNode())
 {
  var c = walker.currentNode;
  if (c.hasAttribute('contentEditable')) {c.removeAttribute('contentEditable');}
  if (c.hasAttribute('style')) {c.removeAttribute('style');}

  if (c.nodeName.toLowerCase()=='script') {element_del(c);}
 }

 alert(new XMLSerializer().serializeToString(f));
 return f;
}


function element_del(element_id)
{
 if (document.getElementById(element_id))
 {
  document.getElementById(element_id).parentNode.removeChild(document.getElementById(element_id));
 }
 else if (element_id)
 {
  element_id.parentNode.removeChild(element_id);
 }
 else
 {
  alert('Error: the object or element \'' + element_id + '\' was not found and therefore could not be deleted.');
 }
}

I recommend cutting frameworks out of your life, it would make things excessively easier for you in the long term.

cloneNode: Cloning a node copies all of its attributes and their values but does NOT copy event listeners.

https://developer.mozilla.org/en/DOM/Node.cloneNode

The following is not tested though I have using treewalkers for some time now and they are one of the most undervalued parts of JavaScript. Here is a list of the node types you can crawl, usually I use SHOW_ELEMENT or SHOW_TEXT.

http://www.w3.org/TR/DOM-Level-2-Traversal-Range/traversal.html#Traversal-NodeFilter

function xhtml_cleaner(id)
{
 var e = document.getElementById(id);
 var f = document.createDocumentFragment();
 f.appendChild(e.cloneNode(true));

 var walker = document.createTreeWalker(f,NodeFilter.SHOW_ELEMENT,null,false);

 while (walker.nextNode())
 {
  var c = walker.currentNode;
  if (c.hasAttribute('contentEditable')) {c.removeAttribute('contentEditable');}
  if (c.hasAttribute('style')) {c.removeAttribute('style');}

  if (c.nodeName.toLowerCase()=='script') {element_del(c);}
 }

 alert(new XMLSerializer().serializeToString(f));
 return f;
}


function element_del(element_id)
{
 if (document.getElementById(element_id))
 {
  document.getElementById(element_id).parentNode.removeChild(document.getElementById(element_id));
 }
 else if (element_id)
 {
  element_id.parentNode.removeChild(element_id);
 }
 else
 {
  alert('Error: the object or element \'' + element_id + '\' was not found and therefore could not be deleted.');
 }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文