我可以将整个 HTML 文档加载到 Internet Explorer 中的文档片段中吗?

发布于 2024-12-05 09:11:32 字数 1018 浏览 1 评论 0 原文

这是我一直遇到的一些困难。我有一个本地客户端脚本,需要允许用户获取远程网页并在结果页面中搜索表单。为了做到这一点(没有正则表达式),我需要将文档解析为完全可遍历的 DOM 对象。

我想强调一些限制:

  • 我不想使用库(如 jQuery)。对于我需要在这里做的事情来说,有太多的臃肿。
  • 在任何情况下都不应执行远程页面的脚本(出于安全原因)。
  • DOM API(例如 getElementsByTagName)需要可用。
  • 它只需要在 Internet Explorer 中工作,但至少在 7 中。
  • 假设我无权访问服务器。我知道,但我不能用它来做这个。

我尝试过的

假设我在变量 html 中有一个完整的 HTML 文档字符串(包括 DOCTYPE 声明),这是我到目前为止所尝试过的:

var frag = document.createDocumentFragment(),
div  = frag.appendChild(document.createElement("div"));

div.outerHTML = html;
//-> results in an empty fragment

div.insertAdjacentHTML("afterEnd", html);
//-> HTML is not added to the fragment

div.innerHTML = html;
//-> Error (expected, but I tried it anyway)

var doc = new ActiveXObject("htmlfile");
doc.write(html);
doc.close();
//-> JavaScript executes

我还尝试提取 <从 HTML 中获取 ;head> 节点并将它们添加到片段内的 元素中,仍然没有运气。

有人有什么想法吗?

Here's something I've been having a little bit of difficulty with. I have a local client-side script that needs to allow a user to fetch a remote web page and search that resulting page for forms. In order to do this (without regex), I need to parse the document into a fully traversable DOM object.

Some limitations I'd like to stress:

  • I don't want to use libraries (like jQuery). There's too much bloat for what I need to do here.
  • Under no circumstances should scripts from the remote page be executed (for security reasons).
  • DOM APIs, such as getElementsByTagName, need to be available.
  • It only needs to work in Internet Explorer, but in 7 at the very least.
  • Let's pretend I don't have access to a server. I do, but I can't use it for this.

What I've tried

Assuming I have a complete HTML document string (including DOCTYPE declaration) in the variable html, here's what I've tried so far:

var frag = document.createDocumentFragment(),
div  = frag.appendChild(document.createElement("div"));

div.outerHTML = html;
//-> results in an empty fragment

div.insertAdjacentHTML("afterEnd", html);
//-> HTML is not added to the fragment

div.innerHTML = html;
//-> Error (expected, but I tried it anyway)

var doc = new ActiveXObject("htmlfile");
doc.write(html);
doc.close();
//-> JavaScript executes

I've also tried extracting the <head> and <body>nodes from the HTML and adding them to a <HTML> element inside the fragment, still no luck.

Does anyone have any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

鹿! 2024-12-12 09:11:32

小提琴http://jsfiddle.net/JFSKe/6/

DocumentFragment不实现 DOM 方法。将 document.createElementinnerHTML 结合使用会删除 标记(甚至当创建的元素是根元素时,)。因此,应该从其他地方寻求解决方案。我创建了一个跨浏览器字符串到 DOM 函数,它利用了不可见的内联框架。

所有外部资源和脚本将被禁用。有关详细信息,请参阅代码说明

代码 代码

/*
 @param String html    The string with HTML which has be converted to a DOM object
 @param func callback  (optional) Callback(HTMLDocument doc, function destroy)
 @returns              undefined if callback exists, else: Object
                        HTMLDocument doc  DOM fetched from Parameter:html
                        function destroy  Removes HTMLDocument doc.         */
function string2dom(html, callback){
    /* Sanitise the string */
    html = sanitiseHTML(html); /*Defined at the bottom of the answer*/

    /* Create an IFrame */
    var iframe = document.createElement("iframe");
    iframe.style.display = "none";
    document.body.appendChild(iframe);

    var doc = iframe.contentDocument || iframe.contentWindow.document;
    doc.open();
    doc.write(html);
    doc.close();

    function destroy(){
        iframe.parentNode.removeChild(iframe);
    }
    if(callback) callback(doc, destroy);
    else return {"doc": doc, "destroy": destroy};
}

/* @name sanitiseHTML
   @param String html  A string representing HTML code
   @return String      A new string, fully stripped of external resources.
                       All "external" attributes (href, src) are prefixed by data- */

function sanitiseHTML(html){
    /* Adds a <!-\"'--> before every matched tag, so that unterminated quotes
        aren't preventing the browser from splitting a tag. Test case:
       '<input style="foo;b:url(0);><input onclick="<input type=button onclick="too() href=;>">' */
    var prefix = "<!--\"'-->";
    /*Attributes should not be prefixed by these characters. This list is not
     complete, but will be sufficient for this function.
      (see http://www.w3.org/TR/REC-xml/#NT-NameChar) */
    var att = "[^-a-z0-9:._]";
    var tag = "<[a-z]";
    var any = "(?:[^<>\"']*(?:\"[^\"]*\"|'[^']*'))*?[^<>]*";
    var etag = "(?:>|(?=<))";

    /*
      @name ae
      @description          Converts a given string in a sequence of the
                             original input and the HTML entity
      @param String string  String to convert
      */
    var entityEnd = "(?:;|(?!\\d))";
    var ents = {" ":"(?:\\s| ?|�*32"+entityEnd+"|�*20"+entityEnd+")",
                "(":"(?:\\(|�*40"+entityEnd+"|�*28"+entityEnd+")",
                ")":"(?:\\)|�*41"+entityEnd+"|�*29"+entityEnd+")",
                ".":"(?:\\.|�*46"+entityEnd+"|�*2e"+entityEnd+")"};
                /*Placeholder to avoid tricky filter-circumventing methods*/
    var charMap = {};
    var s = ents[" "]+"*"; /* Short-hand space */
    /* Important: Must be pre- and postfixed by < and >. RE matches a whole tag! */
    function ae(string){
        var all_chars_lowercase = string.toLowerCase();
        if(ents[string]) return ents[string];
        var all_chars_uppercase = string.toUpperCase();
        var RE_res = "";
        for(var i=0; i<string.length; i++){
            var char_lowercase = all_chars_lowercase.charAt(i);
            if(charMap[char_lowercase]){
                RE_res += charMap[char_lowercase];
                continue;
            }
            var char_uppercase = all_chars_uppercase.charAt(i);
            var RE_sub = [char_lowercase];
            RE_sub.push("�*" + char_lowercase.charCodeAt(0) + entityEnd);
            RE_sub.push("�*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd);
            if(char_lowercase != char_uppercase){
                RE_sub.push("�*" + char_uppercase.charCodeAt(0) + entityEnd);   
                RE_sub.push("�*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd);
            }
            RE_sub = "(?:" + RE_sub.join("|") + ")";
            RE_res += (charMap[char_lowercase] = RE_sub);
        }
        return(ents[string] = RE_res);
    }
    /*
      @name by
      @description  second argument for the replace function.
      */
    function by(match, group1, group2){
        /* Adds a data-prefix before every external pointer */
        return group1 + "data-" + group2 
    }
    /*
      @name cr
      @description            Selects a HTML element and performs a
                                  search-and-replace on attributes
      @param String selector  HTML substring to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String marker    Optional RegExp-escaped; marks the prefix
      @param String delimiter Optional RegExp escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to
                                  end before an occurence of <end> when 
                                  quotes are missing
     */
    function cr(selector, attribute, marker, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        marker = typeof marker == "string" ? marker : "\\s*=";
        delimiter = typeof delimiter == "string" ? delimiter : "";
        end = typeof end == "string" ? end : "";
        var is_end = end && "?";
        var re1 = new RegExp("("+att+")("+attribute+marker+"(?:\\s*\"[^\""+delimiter+"]*\"|\\s*'[^'"+delimiter+"]*'|[^\\s"+delimiter+"]+"+is_end+")"+end+")", "gi");
        html = html.replace(selector, function(match){
            return prefix + match.replace(re1, by);
        });
    }
    /* 
      @name cri
      @description            Selects an attribute of a HTML element, and
                               performs a search-and-replace on certain values
      @param String selector  HTML element to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String front     RegExp-escaped; attribute value, prefix to match
      @param String flags     Optional RegExp flags, default "gi"
      @param String delimiter Optional RegExp-escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to
                                  end before an occurence of <end> when 
                                  quotes are missing
     */
    function cri(selector, attribute, front, flags, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        flags = typeof flags == "string" ? flags : "gi";
         var re1 = new RegExp("("+att+attribute+"\\s*=)((?:\\s*\"[^\"]*\"|\\s*'[^']*'|[^\\s>]+))", "gi");

        end = typeof end == "string" ? end + ")" : ")";
        var at1 = new RegExp('(")('+front+'[^"]+")', flags);
        var at2 = new RegExp("(')("+front+"[^']+')", flags);
        var at3 = new RegExp("()("+front+'(?:"[^"]+"|\'[^\']+\'|(?:(?!'+delimiter+').)+)'+end, flags);

        var handleAttr = function(match, g1, g2){
            if(g2.charAt(0) == '"') return g1+g2.replace(at1, by);
            if(g2.charAt(0) == "'") return g1+g2.replace(at2, by);
            return g1+g2.replace(at3, by);
        };
        html = html.replace(selector, function(match){
             return prefix + match.replace(re1, handleAttr);
        });
    }

    /* <meta http-equiv=refresh content="  ; url= " > */
    html = html.replace(new RegExp("<meta"+any+att+"http-equiv\\s*=\\s*(?:\""+ae("refresh")+"\""+any+etag+"|'"+ae("refresh")+"'"+any+etag+"|"+ae("refresh")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "gi"), "<!-- meta http-equiv=refresh stripped-->");

    /* Stripping all scripts */
    html = html.replace(new RegExp("<script"+any+">\\s*//\\s*<\\[CDATA\\[[\\S\\s]*?]]>\\s*</script[^>]*>", "gi"), "<!--CDATA script-->");
    html = html.replace(/<script[\S\s]+?<\/script\s*>/gi, "<!--Non-CDATA script-->");
    cr(tag+any+att+"on[-a-z0-9:_.]+="+any+etag, "on[-a-z0-9:_.]+"); /* Event listeners */

    cr(tag+any+att+"href\\s*="+any+etag, "href"); /* Linked elements */
    cr(tag+any+att+"src\\s*="+any+etag, "src"); /* Embedded elements */

    cr("<object"+any+att+"data\\s*="+any+etag, "data"); /* <object data= > */
    cr("<applet"+any+att+"codebase\\s*="+any+etag, "codebase"); /* <applet codebase= > */

    /* <param name=movie value= >*/
    cr("<param"+any+att+"name\\s*=\\s*(?:\""+ae("movie")+"\""+any+etag+"|'"+ae("movie")+"'"+any+etag+"|"+ae("movie")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "value");

    /* <style> and < style=  > url()*/
    cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "url", "\\s*\\(\\s*", "", "\\s*\\)");
    cri(tag+any+att+"style\\s*="+any+etag, "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")"));

    /* IE7- CSS expression() */
    cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "expression", "\\s*\\(\\s*", "", "\\s*\\)");
    cri(tag+any+att+"style\\s*="+any+etag, "style", ae("expression")+s+ae("(")+s, 0, s+ae(")"), ae(")"));
    return html.replace(new RegExp("(?:"+prefix+")+", "g"), prefix);
}

说明

sanitiseHTML 函数基于我的 replace_all_rel_by_abs 函数(请参阅 这个答案)。不过,sanitiseHTML 函数已完全重写,以实现最大的效率和可靠性。

此外,还添加了一组新的正则表达式来删除所有脚本和事件处理程序(包括 CSS expression()、IE7-)。为了确保所有标签都按预期进行解析,调整后的标签以 为前缀。此前缀对于正确解析嵌套的“事件处理程序”是必要的带有未终止的引号:">

这些正则表达式是使用内部函数 cr/cri 动态创建的(Create Replace [I nline])。这些函数接受参数列表,并创建和执行高级 RE 替换。为了确保 HTML 实体不会破坏 RegExp( 中的refresh 可以通过多种方式编写),动态创建的 RegExp部分由函数aeAny Entity)构造。
实际的替换是通过函数by(替换by)完成的。在此实现中,by 在所有匹配的属性之前添加 data-

  1. 所有出现的 都被条带化。此步骤是必要的,因为 CDATA 部分允许代码中包含 字符串。执行此替换后,可以安全地转到下一个替换:
  2. 剩余的 标记将被删除。
  3. 标记已删除
  4. All 事件侦听器和外部指针/属性 (hrefsrcurl()) 均以 data- 为前缀,如前所述。

  5. 创建了一个 IFrame 对象。 IFrame 不太可能泄漏内存(与 htmlfile ActiveXObject 相反)。 IFrame 变得不可见,并附加到文档中,以便可以访问 DOM。 document.write() 用于将 HTML 写入 IFrame。 document.open()document.close() 用于清空文档之前的内容,以便生成的文档是给定 html 字符串。

  6. 如果指定了回调函数,则将使用两个参数调用该函数。 first 参数是对生成的 document 对象的引用。 第二个参数是一个函数,它在调用时会销毁生成的 DOM 树。当您不再需要树时,应该调用此函数。
    如果未指定回调函数,该函数将返回一个由两个属性(doc destroy),其行为与前面提到的参数相同。

其他说明

  • designMode 属性设置为“On”将停止框架执行脚本(Chrome 不支持)。如果出于特定原因必须保留
  • 我无法找到 htmlfile activeXObject 的可靠来源。根据 此来源htmlfile 比IFrame,并且更容易受到内存泄漏的影响。
  • 所有受影响的属性(hrefsrc...)均以 data-< 为前缀/代码>。 data-href 显示了获取/更改这些属性的示例:
    elem.getAttribute("data-href")elem.setAttribute( "data-href", "...")
    elem.dataset.hrefelem.dataset.href = "..."
  • 外部资源已被禁用。因此,页面可能看起来完全不同:
    否外部样式
    无脚本样式
    没有图像:元素的大小可能完全不同。

示例

< strong>sanitiseHTML(html)
将此小书签粘贴到该位置的栏中。它将提供一个注入文本区域的选项,显示经过清理的 HTML 字符串。

javascript:void(function(){var s=document.createElement("script");s.src="http://rob.lekensteyn.nl/html-sanitizer.js";document.body.appendChild(s)})();

代码示例 - string2dom(html)

string2dom("<html><head><title>Test</title></head></html>", function(doc, destroy){
    alert(doc.title); /* Alert: "Test" */
    destroy();
});

var test = string2dom("<div id='secret'></div>");
alert(test.doc.getElementById("secret").tagName); /* Alert: "DIV" */
test.destroy();

值得注意的参考文献

Fiddle: http://jsfiddle.net/JFSKe/6/

DocumentFragment doesn't implement DOM methods. Using document.createElement in conjunction with innerHTML removes the <head> and <body> tags (even when the created element is a root element, <html>). Therefore, the solution should be sought elsewhere. I have created a cross-browser string-to-DOM function, which makes use of an invisible inline-frame.

All external resources and scripts will be disabled. See Explanation of the code for more information.

Code

/*
 @param String html    The string with HTML which has be converted to a DOM object
 @param func callback  (optional) Callback(HTMLDocument doc, function destroy)
 @returns              undefined if callback exists, else: Object
                        HTMLDocument doc  DOM fetched from Parameter:html
                        function destroy  Removes HTMLDocument doc.         */
function string2dom(html, callback){
    /* Sanitise the string */
    html = sanitiseHTML(html); /*Defined at the bottom of the answer*/

    /* Create an IFrame */
    var iframe = document.createElement("iframe");
    iframe.style.display = "none";
    document.body.appendChild(iframe);

    var doc = iframe.contentDocument || iframe.contentWindow.document;
    doc.open();
    doc.write(html);
    doc.close();

    function destroy(){
        iframe.parentNode.removeChild(iframe);
    }
    if(callback) callback(doc, destroy);
    else return {"doc": doc, "destroy": destroy};
}

/* @name sanitiseHTML
   @param String html  A string representing HTML code
   @return String      A new string, fully stripped of external resources.
                       All "external" attributes (href, src) are prefixed by data- */

function sanitiseHTML(html){
    /* Adds a <!-\"'--> before every matched tag, so that unterminated quotes
        aren't preventing the browser from splitting a tag. Test case:
       '<input style="foo;b:url(0);><input onclick="<input type=button onclick="too() href=;>">' */
    var prefix = "<!--\"'-->";
    /*Attributes should not be prefixed by these characters. This list is not
     complete, but will be sufficient for this function.
      (see http://www.w3.org/TR/REC-xml/#NT-NameChar) */
    var att = "[^-a-z0-9:._]";
    var tag = "<[a-z]";
    var any = "(?:[^<>\"']*(?:\"[^\"]*\"|'[^']*'))*?[^<>]*";
    var etag = "(?:>|(?=<))";

    /*
      @name ae
      @description          Converts a given string in a sequence of the
                             original input and the HTML entity
      @param String string  String to convert
      */
    var entityEnd = "(?:;|(?!\\d))";
    var ents = {" ":"(?:\\s| ?|�*32"+entityEnd+"|�*20"+entityEnd+")",
                "(":"(?:\\(|�*40"+entityEnd+"|�*28"+entityEnd+")",
                ")":"(?:\\)|�*41"+entityEnd+"|�*29"+entityEnd+")",
                ".":"(?:\\.|�*46"+entityEnd+"|�*2e"+entityEnd+")"};
                /*Placeholder to avoid tricky filter-circumventing methods*/
    var charMap = {};
    var s = ents[" "]+"*"; /* Short-hand space */
    /* Important: Must be pre- and postfixed by < and >. RE matches a whole tag! */
    function ae(string){
        var all_chars_lowercase = string.toLowerCase();
        if(ents[string]) return ents[string];
        var all_chars_uppercase = string.toUpperCase();
        var RE_res = "";
        for(var i=0; i<string.length; i++){
            var char_lowercase = all_chars_lowercase.charAt(i);
            if(charMap[char_lowercase]){
                RE_res += charMap[char_lowercase];
                continue;
            }
            var char_uppercase = all_chars_uppercase.charAt(i);
            var RE_sub = [char_lowercase];
            RE_sub.push("�*" + char_lowercase.charCodeAt(0) + entityEnd);
            RE_sub.push("�*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd);
            if(char_lowercase != char_uppercase){
                RE_sub.push("�*" + char_uppercase.charCodeAt(0) + entityEnd);   
                RE_sub.push("�*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd);
            }
            RE_sub = "(?:" + RE_sub.join("|") + ")";
            RE_res += (charMap[char_lowercase] = RE_sub);
        }
        return(ents[string] = RE_res);
    }
    /*
      @name by
      @description  second argument for the replace function.
      */
    function by(match, group1, group2){
        /* Adds a data-prefix before every external pointer */
        return group1 + "data-" + group2 
    }
    /*
      @name cr
      @description            Selects a HTML element and performs a
                                  search-and-replace on attributes
      @param String selector  HTML substring to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String marker    Optional RegExp-escaped; marks the prefix
      @param String delimiter Optional RegExp escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to
                                  end before an occurence of <end> when 
                                  quotes are missing
     */
    function cr(selector, attribute, marker, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        marker = typeof marker == "string" ? marker : "\\s*=";
        delimiter = typeof delimiter == "string" ? delimiter : "";
        end = typeof end == "string" ? end : "";
        var is_end = end && "?";
        var re1 = new RegExp("("+att+")("+attribute+marker+"(?:\\s*\"[^\""+delimiter+"]*\"|\\s*'[^'"+delimiter+"]*'|[^\\s"+delimiter+"]+"+is_end+")"+end+")", "gi");
        html = html.replace(selector, function(match){
            return prefix + match.replace(re1, by);
        });
    }
    /* 
      @name cri
      @description            Selects an attribute of a HTML element, and
                               performs a search-and-replace on certain values
      @param String selector  HTML element to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String front     RegExp-escaped; attribute value, prefix to match
      @param String flags     Optional RegExp flags, default "gi"
      @param String delimiter Optional RegExp-escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to
                                  end before an occurence of <end> when 
                                  quotes are missing
     */
    function cri(selector, attribute, front, flags, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        flags = typeof flags == "string" ? flags : "gi";
         var re1 = new RegExp("("+att+attribute+"\\s*=)((?:\\s*\"[^\"]*\"|\\s*'[^']*'|[^\\s>]+))", "gi");

        end = typeof end == "string" ? end + ")" : ")";
        var at1 = new RegExp('(")('+front+'[^"]+")', flags);
        var at2 = new RegExp("(')("+front+"[^']+')", flags);
        var at3 = new RegExp("()("+front+'(?:"[^"]+"|\'[^\']+\'|(?:(?!'+delimiter+').)+)'+end, flags);

        var handleAttr = function(match, g1, g2){
            if(g2.charAt(0) == '"') return g1+g2.replace(at1, by);
            if(g2.charAt(0) == "'") return g1+g2.replace(at2, by);
            return g1+g2.replace(at3, by);
        };
        html = html.replace(selector, function(match){
             return prefix + match.replace(re1, handleAttr);
        });
    }

    /* <meta http-equiv=refresh content="  ; url= " > */
    html = html.replace(new RegExp("<meta"+any+att+"http-equiv\\s*=\\s*(?:\""+ae("refresh")+"\""+any+etag+"|'"+ae("refresh")+"'"+any+etag+"|"+ae("refresh")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "gi"), "<!-- meta http-equiv=refresh stripped-->");

    /* Stripping all scripts */
    html = html.replace(new RegExp("<script"+any+">\\s*//\\s*<\\[CDATA\\[[\\S\\s]*?]]>\\s*</script[^>]*>", "gi"), "<!--CDATA script-->");
    html = html.replace(/<script[\S\s]+?<\/script\s*>/gi, "<!--Non-CDATA script-->");
    cr(tag+any+att+"on[-a-z0-9:_.]+="+any+etag, "on[-a-z0-9:_.]+"); /* Event listeners */

    cr(tag+any+att+"href\\s*="+any+etag, "href"); /* Linked elements */
    cr(tag+any+att+"src\\s*="+any+etag, "src"); /* Embedded elements */

    cr("<object"+any+att+"data\\s*="+any+etag, "data"); /* <object data= > */
    cr("<applet"+any+att+"codebase\\s*="+any+etag, "codebase"); /* <applet codebase= > */

    /* <param name=movie value= >*/
    cr("<param"+any+att+"name\\s*=\\s*(?:\""+ae("movie")+"\""+any+etag+"|'"+ae("movie")+"'"+any+etag+"|"+ae("movie")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "value");

    /* <style> and < style=  > url()*/
    cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "url", "\\s*\\(\\s*", "", "\\s*\\)");
    cri(tag+any+att+"style\\s*="+any+etag, "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")"));

    /* IE7- CSS expression() */
    cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "expression", "\\s*\\(\\s*", "", "\\s*\\)");
    cri(tag+any+att+"style\\s*="+any+etag, "style", ae("expression")+s+ae("(")+s, 0, s+ae(")"), ae(")"));
    return html.replace(new RegExp("(?:"+prefix+")+", "g"), prefix);
}

Explanation of the code

The sanitiseHTML function is based on my replace_all_rel_by_abs function (see this answer). The sanitiseHTML function is completely rewritten though, in order to achieve maximum efficiency and reliability.

Additionally, a new set of RegExps are added to remove all scripts and event handlers (including CSS expression(), IE7-). To make sure that all tags are parsed as expected, the adjusted tags are prefixed by <!--'"-->. This prefix is necessary to correctly parse nested "event handlers" in conjunction with unterminated quotes: <a id="><input onclick="<div onmousemove=evil()>">.

These RegExps are dynamically created using an internal function cr/cri (Create Replace [Inline]). These functions accept a list of arguments, and create and execute an advanced RE replacement. To make sure that HTML entities aren't breaking a RegExp (refresh in <meta http-equiv=refresh> could be written in various ways), the dynamically created RegExps are partially constructed by function ae (Any Entity).
The actual replacements are done by function by (replace by). In this implementation, by adds data- before all matched attributes.

  1. All <script>//<[CDATA[ .. //]]></script> occurrences are striped. This step is necessary, because CDATA sections allow </script> strings inside the code. After this replacement has been executed, it's safe to go to the next replacement:
  2. The remaining <script>...</script> tags are removed.
  3. The <meta http-equiv=refresh .. > tag is removed
  4. All event listeners and external pointers/attributes (href, src, url()) are prefixed by data-, as described previously.

  5. An IFrame object is created. IFrames are less likely to leak memory (contrary to the htmlfile ActiveXObject). The IFrame becomes invisible, and is appended to the document, so that the DOM can be accessed. document.write() are used to write HTML to the IFrame. document.open() and document.close() are used to empty the previous contents of the document, so that the generated document is an exact copy of the given html string.

  6. If a callback function has been specified, the function will be called with two arguments. The first argument is a reference to the generated document object. The second argument is a function, which destroys the generated DOM tree when called. This function should be called when you don't need the tree any more.
    If the callback function isn't specified, the function returns an object consisting of two properties (doc and destroy), which behave the same as the previously mentioned arguments.

Additional notes

  • Setting the designMode property to "On" will stop a frame from executing scripts (not supported in Chrome). If you have to preserve the <script> tags for a specific reason, you can use iframe.designMode = "On" instead of the script stripping feature.
  • I wasn't able to find a reliable source for the htmlfile activeXObject. According to this source, htmlfile is slower than IFrames, and more susceptible to memory leaks.
  • All affected attributes (href, src, ...) are prefixed by data-. An example of getting/changing these attributes is shown for data-href:
    elem.getAttribute("data-href") and elem.setAttribute("data-href", "...")
    elem.dataset.href and elem.dataset.href = "...".
  • External resources have been disabled. As a result, the page may look completely different:
    <link rel="stylesheet" href="main.css" /> No external styles
    <script>document.body.bgColor="red";</script> No scripted styles
    <img src="128x128.png" /> No images: the size of the element may be completely different.

Examples

sanitiseHTML(html)
Paste this bookmarklet in the location's bar. It will offer an option to inject a textarea, showing the sanitised HTML string.

javascript:void(function(){var s=document.createElement("script");s.src="http://rob.lekensteyn.nl/html-sanitizer.js";document.body.appendChild(s)})();

Code examples - string2dom(html):

string2dom("<html><head><title>Test</title></head></html>", function(doc, destroy){
    alert(doc.title); /* Alert: "Test" */
    destroy();
});

var test = string2dom("<div id='secret'></div>");
alert(test.doc.getElementById("secret").tagName); /* Alert: "DIV" */
test.destroy();

Notable references

那支青花 2024-12-12 09:11:32

不知道为什么要搞乱 documentFragments,您只需将 HTML 文本设置为新 div 元素的 innerHTML 即可。然后,您可以将该 div 元素用于 getElementsByTagName 等,而无需将 div 添加到 DOM:

var htmlText= '<html><head><title>Test</title></head><body><div id="test_ele1">this is test_ele1 content</div><div id="test_ele2">this is test_ele content2</div></body></html>';

var d = document.createElement('div');
d.innerHTML = htmlText;

console.log(d.getElementsByTagName('div'));

如果您真的很喜欢 documentFragment 的想法,您可以使用此代码,但您仍然需要将其包装在 div 中以获得您想要的 DOM 函数:

function makeDocumentFragment(htmlText) {
    var range = document.createRange();
    var frag = range.createContextualFragment(htmlText);
    var d = document.createElement('div');
    d.appendChild(frag);
    return d;
}

Not sure why you're messing with documentFragments, you can just set the HTML text as the innerHTML of a new div element. Then you can use that div element for getElementsByTagName etc without adding the div to DOM:

var htmlText= '<html><head><title>Test</title></head><body><div id="test_ele1">this is test_ele1 content</div><div id="test_ele2">this is test_ele content2</div></body></html>';

var d = document.createElement('div');
d.innerHTML = htmlText;

console.log(d.getElementsByTagName('div'));

If you're really married to the idea of a documentFragment, you can use this code, but you'll still have to wrap it in a div to get the DOM functions you're after:

function makeDocumentFragment(htmlText) {
    var range = document.createRange();
    var frag = range.createContextualFragment(htmlText);
    var d = document.createElement('div');
    d.appendChild(frag);
    return d;
}
夜访吸血鬼 2024-12-12 09:11:32

我不确定 IE 是否支持 document.implementation.createHTMLDocument,但如果支持,请使用此算法(改编自我的 DOMParser HTML 扩展)。请注意,DOCTYPE 将不会被保留。:

var
      doc = document.implementation.createHTMLDocument("")
    , doc_elt = doc.documentElement
    , first_elt
;
doc_elt.innerHTML = your_html_here;
first_elt = doc_elt.firstElementChild;
if ( // are we dealing with an entire document or a fragment?
       doc_elt.childElementCount === 1
    && first_elt.tagName.toLowerCase() === "html"
) {
    doc.replaceChild(first_elt, doc_elt);
}

// doc is an HTML document
// you can now reference stuff like doc.title, etc.

I'm not sure if IE supports document.implementation.createHTMLDocument, but if it does, use this algorithm (adapted from my DOMParser HTML extension). Note that the DOCTYPE will not be preserved.:

var
      doc = document.implementation.createHTMLDocument("")
    , doc_elt = doc.documentElement
    , first_elt
;
doc_elt.innerHTML = your_html_here;
first_elt = doc_elt.firstElementChild;
if ( // are we dealing with an entire document or a fragment?
       doc_elt.childElementCount === 1
    && first_elt.tagName.toLowerCase() === "html"
) {
    doc.replaceChild(first_elt, doc_elt);
}

// doc is an HTML document
// you can now reference stuff like doc.title, etc.
请爱~陌生人 2024-12-12 09:11:32

假设 HTML 也是有效的 XML,您可以使用 loadXML()

Assuming the HTML is valid XML too, you may use loadXML()

无畏 2024-12-12 09:11:32

DocumentFragment 不支持 getElementsByTagName —— 只有 Document 支持。

您可能需要使用像 jsdom 这样的库,它提供了 DOM 的实现,您可以通过它进行搜索使用 getElementsByTagName 和其他 DOM API。并且可以将其设置为不执行脚本。是的,它很“重”,我不知道它是否可以在 IE 7 中运行。

DocumentFragment doesn't support getElementsByTagName -- that's only supported by Document.

You may need to use a library like jsdom, which provides an implementation of the DOM and through which you can search using getElementsByTagName and other DOM APIs. And you can set it to not execute scripts. Yes, it's 'heavy' and I don't know if it works in IE 7.

[浮城] 2024-12-12 09:11:32

只是在这个页面上闲逛,有点晚了:)但以下内容应该可以帮助将来遇到类似问题的任何人...但是 IE7/8 现在确实应该被忽略,并且有更好的方法支持更现代的浏览器。

以下几乎适用于我测试过的所有内容 - 唯一的两个缺点是:

  1. 我已将定制的 getElementByIdgetElementsByName 函数添加到根 div 元素,因此这些不会按预期出现在树的下面(除非修改代码以满足此需求)

  2. 文档类型将被忽略 - 但是我认为这不会有太大区别,因为我的经验是文档类型不会影响 dom 的结构,只是它的渲染方式(这显然不会发生在这个方法)

基本上,系统依赖于这样一个事实:用户代理对 进行不同的处理。正如已经发现的那样,某些特殊标签不能存在于 div 元素中,因此它们被删除。命名空间元素可以放置在任何地方(除非有 DTD 另有说明)。虽然这些命名空间标签实际上不会像真正的标签一样工作,但考虑到我们只是将它们用于文档中的结构位置,所以并不会真正造成问题。

标记和代码如下:

<!DOCTYPE html>
<html>
<head>
<script>

  /// function for parsing HTML source to a dom structure
  /// Tested in Mac OSX, Win 7, Win XP with FF, IE 7/8/9, 
  /// Chrome, Safari & Opera.
  function parseHTML(src){

    /// create a random div, this will be our root
    var div = document.createElement('div'),
        /// specificy our namespace prefix
        ns = 'faux:',
        /// state which tags we will treat as "special"
        stn = ['html','head','body','title'];
        /// the reg exp for replacing the special tags
        re = new RegExp('<(/?)('+stn.join('|')+')([^>]*)?>','gi'),
        /// remember the getElementsByTagName function before we override it
        gtn = div.getElementsByTagName;

    /// a quick function to namespace certain tag names
    var nspace = function(tn){
      if ( stn.indexOf ) {
        return stn.indexOf(tn) != -1 ? ns + tn : tn;
      }
      else {
        return ('|'+stn.join('|')+'|').indexOf(tn) != -1 ? ns + tn : tn;
      }
    };

    /// search and replace our source so that special tags are namespaced
    ///   required for IE7/8 to render tags before first text found
    /// <faux:check /> tag added so we can test how namespaces work
    src = ' <'+ns+'check />' + src.replace(re,'<$1'+ns+'$2$3>');
    /// inject to the div
    div.innerHTML = src;
    /// quick test to see how we support namespaces in TagName searches
    if ( !div.getElementsByTagName(ns+'check').length ) {
      ns = '';
    }

    /// create our replacement getByName and getById functions
    var createGetElementByAttr = function(attr, collect){
      var func = function(a,w){
        var i,c,e,f,l,o; w = w||[];
        if ( this.nodeType == 1 ) {
          if ( this.getAttribute(attr) == a ) {
            if ( collect ) {
              w.push(this);
            }
            else {
              return this;
            }
          }
        }
        else {
          return false;
        }
        if ( (c = this.childNodes) && (l = c.length) ) {
          for( i=0; i<l; i++ ){
            if( (e = c[i]) && (e.nodeType == 1) ) {
              if ( (f = func.call( e, a, w )) && !collect ) {
                return f;
              }
            }
          }
        }
        return (w.length?w:false);
      }
      return func;
    }

    /// apply these replacement functions to the div container, obviously 
    /// you could add these to prototypes for browsers the support element 
    /// constructors. For other browsers you could step each element and 
    /// apply the functions through-out the node tree... however this would  
    /// be quite messy, far better just to always call from the root node - 
    /// or use div.getElementsByTagName.call( localElement, 'tag' );
    div.getElementsByTagName = function(t){return gtn.call(this,nspace(t));}
    div.getElementsByName    = createGetElementByAttr('name', true);
    div.getElementById       = createGetElementByAttr('id', false);

    /// return the final element
    return div;
  }

  window.onload = function(){

    /// parse the HTML source into a node tree
    var dom = parseHTML( document.getElementById('source').innerHTML );

    /// test some look ups :)
    var a = dom.getElementsByTagName('head'),
        b = dom.getElementsByTagName('title'),
        c = dom.getElementsByTagName('script'),
        d = dom.getElementById('body');

    /// alert the result
    alert(a[0].innerHTML);
    alert(b[0].innerHTML);
    alert(c[0].innerHTML);
    alert(d.innerHTML);

  }
</script>
</head>
<body>
  <xmp id="source">
    <!DOCTYPE html>
    <html>
    <head>
      <!-- Comment //-->
      <meta charset="utf-8">
      <meta name="robots" content="index, follow">
      <title>An example</title>
      <link href="test.css" />
      <script>alert('of parsing..');</script>
    </head>
    <body id="body">
      <b>in a similar way to createDocumentFragment</b>
    </body>
    </html>
  </xmp>
</body>
</html>

Just wandered across this page, am a bit late to be of any use :) but the following should help anyone with a similar problem in future... however IE7/8 should really be ignored by now and there are much better methods supported by the more modern browsers.

The following works across nearly eveything I've tested - the only two down sides are:

  1. I've added bespoke getElementById and getElementsByName functions to the root div element, so these wont appear as expected futher down the tree (unless the code is modified to cater for this).

  2. The doctype will be ignored - however I don't think this will make much difference as my experience is that the doctype wont effect how the dom is structured, just how it is rendered (which obviously wont happen with this method).

Basically the system relies on the fact that <tag> and <namespace:tag> are treated differently by the useragents. As has been found certain special tags can not exist within a div element, and so therefore they are removed. Namespaced elements can be placed anywhere (unless there is a DTD stating otherwise). Whilst these namespace tags wont actually behave as the real tags in question, considering we are only really using them for their structural position in the document it doesn't really cause a problem.

markup and code are as follows:

<!DOCTYPE html>
<html>
<head>
<script>

  /// function for parsing HTML source to a dom structure
  /// Tested in Mac OSX, Win 7, Win XP with FF, IE 7/8/9, 
  /// Chrome, Safari & Opera.
  function parseHTML(src){

    /// create a random div, this will be our root
    var div = document.createElement('div'),
        /// specificy our namespace prefix
        ns = 'faux:',
        /// state which tags we will treat as "special"
        stn = ['html','head','body','title'];
        /// the reg exp for replacing the special tags
        re = new RegExp('<(/?)('+stn.join('|')+')([^>]*)?>','gi'),
        /// remember the getElementsByTagName function before we override it
        gtn = div.getElementsByTagName;

    /// a quick function to namespace certain tag names
    var nspace = function(tn){
      if ( stn.indexOf ) {
        return stn.indexOf(tn) != -1 ? ns + tn : tn;
      }
      else {
        return ('|'+stn.join('|')+'|').indexOf(tn) != -1 ? ns + tn : tn;
      }
    };

    /// search and replace our source so that special tags are namespaced
    ///   required for IE7/8 to render tags before first text found
    /// <faux:check /> tag added so we can test how namespaces work
    src = ' <'+ns+'check />' + src.replace(re,'<$1'+ns+'$2$3>');
    /// inject to the div
    div.innerHTML = src;
    /// quick test to see how we support namespaces in TagName searches
    if ( !div.getElementsByTagName(ns+'check').length ) {
      ns = '';
    }

    /// create our replacement getByName and getById functions
    var createGetElementByAttr = function(attr, collect){
      var func = function(a,w){
        var i,c,e,f,l,o; w = w||[];
        if ( this.nodeType == 1 ) {
          if ( this.getAttribute(attr) == a ) {
            if ( collect ) {
              w.push(this);
            }
            else {
              return this;
            }
          }
        }
        else {
          return false;
        }
        if ( (c = this.childNodes) && (l = c.length) ) {
          for( i=0; i<l; i++ ){
            if( (e = c[i]) && (e.nodeType == 1) ) {
              if ( (f = func.call( e, a, w )) && !collect ) {
                return f;
              }
            }
          }
        }
        return (w.length?w:false);
      }
      return func;
    }

    /// apply these replacement functions to the div container, obviously 
    /// you could add these to prototypes for browsers the support element 
    /// constructors. For other browsers you could step each element and 
    /// apply the functions through-out the node tree... however this would  
    /// be quite messy, far better just to always call from the root node - 
    /// or use div.getElementsByTagName.call( localElement, 'tag' );
    div.getElementsByTagName = function(t){return gtn.call(this,nspace(t));}
    div.getElementsByName    = createGetElementByAttr('name', true);
    div.getElementById       = createGetElementByAttr('id', false);

    /// return the final element
    return div;
  }

  window.onload = function(){

    /// parse the HTML source into a node tree
    var dom = parseHTML( document.getElementById('source').innerHTML );

    /// test some look ups :)
    var a = dom.getElementsByTagName('head'),
        b = dom.getElementsByTagName('title'),
        c = dom.getElementsByTagName('script'),
        d = dom.getElementById('body');

    /// alert the result
    alert(a[0].innerHTML);
    alert(b[0].innerHTML);
    alert(c[0].innerHTML);
    alert(d.innerHTML);

  }
</script>
</head>
<body>
  <xmp id="source">
    <!DOCTYPE html>
    <html>
    <head>
      <!-- Comment //-->
      <meta charset="utf-8">
      <meta name="robots" content="index, follow">
      <title>An example</title>
      <link href="test.css" />
      <script>alert('of parsing..');</script>
    </head>
    <body id="body">
      <b>in a similar way to createDocumentFragment</b>
    </body>
    </html>
  </xmp>
</body>
</html>
下雨或天晴 2024-12-12 09:11:32

要使用完整的 HTML DOM 功能而不触发请求,无需处理不兼容性:

var doc = document.cloneNode();
if (!doc.documentElement) {
    doc.appendChild(doc.createElement('html'));
    doc.documentElement.appendChild(doc.createElement('head'));
    doc.documentElement.appendChild(doc.createElement('body'));
}

一切就绪! doc是一个html文档,但它不是在线的。

To use full HTML DOM abilities without triggering requests, without having to deal with incompatibilities:

var doc = document.cloneNode();
if (!doc.documentElement) {
    doc.appendChild(doc.createElement('html'));
    doc.documentElement.appendChild(doc.createElement('head'));
    doc.documentElement.appendChild(doc.createElement('body'));
}

All set ! doc is an html document, but it is not online.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文