JavaScript 中严格的 HTML 解析

发布于 2025-01-06 18:10:07 字数 408 浏览 0 评论 0原文

在 Google Chrome (Canary) 上，似乎没有字符串可以使 DOM 解析器失败。我正在尝试解析一些 HTML，但如果 HTML 不完全、100% 有效，我希望它显示错误。我已经尝试过显而易见的方法：

var newElement = document.createElement('div');
newElement.innerHTML = someMarkup; // Might fail on IE, never on Chrome.

我还尝试了

那么，至少有某种方法可以在 Google Chrome 中“严格”解析 HTML 吗？我不想自己对其进行标记或使用外部验证实用程序。如果没有其他选择，严格的 XML 解析器就可以，但某些元素不需要 HTML 中的结束标记，并且最好这些元素不应失败。

原文

On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse some HTML, but if the HTML isn't completely, 100%, valid, I want it to display an error. I've tried the obvious:

var newElement = document.createElement('div');
newElement.innerHTML = someMarkup; // Might fail on IE, never on Chrome.

I've also tried the method in this question. Doesn't fail for invalid markup, even the most invalid markup I can produce.

So, is there some way to parse HTML "strictly" in Google Chrome at least? I don't want to resort to tokenizing it myself or using an external validation utility. If there's no other alternative, a strict XML parser is fine, but certain elements don't require closing tags in HTML, and preferably those shouldn't fail.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

祁梦 2025-01-13 18:10:07

使用 DOMParser 分两步检查文档：

通过将文档解析为 XML 来验证文档是否符合 XML。
将字符串解析为 HTML。这需要修改 DOMParser。

循环遍历每个元素，并检查 DOM 元素是否是 HTMLUnknownElement 的实例。为此，getElementsByTagName('*') 非常适合。
（如果想严格解析文档，就得递归循环遍历每个元素，并记住该元素是否是允许放置在该位置。例如
中的
）

演示： <一href="http://jsfiddle.net/q66Ep/1/" rel="noreferrer">http://jsfiddle.net/q66Ep/1/

/* DOM parser for text/html, see https://stackoverflow.com/a/9251106/938089 */
;(function(DOMParser) {"use strict";var DOMParser_proto=DOMParser.prototype,real_parseFromString=DOMParser_proto.parseFromString;try{if((new DOMParser).parseFromString("", "text/html"))return;}catch(e){}DOMParser_proto.parseFromString=function(markup,type){if(/^\s*text\/html\s*(;|$)/i.test(type)){var doc=document.implementation.createHTMLDocument(""),doc_elt=doc.documentElement,first_elt;doc_elt.innerHTML=markup;first_elt=doc_elt.firstElementChild;if (doc_elt.childElementCount===1&&first_elt.localName.toLowerCase()==="html")doc.replaceChild(first_elt,doc_elt);return doc;}else{return real_parseFromString.apply(this, arguments);}};}(DOMParser));

/*
 * @description              Validate a HTML string
 * @param       String html  The HTML string to be validated 
 * @returns            null  If the string is not wellformed XML
 *                    false  If the string contains an unknown element
 *                     true  If the string satisfies both conditions
 */
function validateHTML(html) {
    var parser = new DOMParser()
      , d = parser.parseFromString('<?xml version="1.0"?>'+html,'text/xml')
      , allnodes;
    if (d.querySelector('parsererror')) {
        console.log('Not welformed HTML (XML)!');
        return null;
    } else {
        /* To use text/html, see https://stackoverflow.com/a/9251106/938089 */
        d = parser.parseFromString(html, 'text/html');
        allnodes = d.getElementsByTagName('*');
        for (var i=allnodes.length-1; i>=0; i--) {
            if (allnodes[i] instanceof HTMLUnknownElement) return false;
        }
    }
    return true; /* The document is syntactically correct, all tags are closed */
}

console.log(validateHTML('<div>'));  //  null, because of the missing close tag
console.log(validateHTML('<x></x>'));// false, because it's not a HTML element
console.log(validateHTML('<a></a>'));//  true, because the tag is closed,
                                     //       and the element is a HTML element

参见此答案的修订版 1 用于替代不使用 DOMParser 的 XML 验证。

注意事项

当前方法完全忽略 doctype，因为验证。
对于，此方法返回 null，而它是有效的 HTML5（因为标签未关闭）。
不检查一致性。

Use the DOMParser to check a document in two steps:

Validate whether the document is XML-conforming, by parsing it as XML.
Parse the string as HTML. This requires a modification on the DOMParser.

Loop through each element, and check whether the DOM element is an instance of HTMLUnknownElement. For this purpose, getElementsByTagName('*') fits well.
(If you want to strictly parse the document, you have to recursively loop through each element, and remember whether the element is allowed to be placed at that location. Eg. <area> in <map>)

Demo: http://jsfiddle.net/q66Ep/1/

/* DOM parser for text/html, see https://stackoverflow.com/a/9251106/938089 */
;(function(DOMParser) {"use strict";var DOMParser_proto=DOMParser.prototype,real_parseFromString=DOMParser_proto.parseFromString;try{if((new DOMParser).parseFromString("", "text/html"))return;}catch(e){}DOMParser_proto.parseFromString=function(markup,type){if(/^\s*text\/html\s*(;|$)/i.test(type)){var doc=document.implementation.createHTMLDocument(""),doc_elt=doc.documentElement,first_elt;doc_elt.innerHTML=markup;first_elt=doc_elt.firstElementChild;if (doc_elt.childElementCount===1&&first_elt.localName.toLowerCase()==="html")doc.replaceChild(first_elt,doc_elt);return doc;}else{return real_parseFromString.apply(this, arguments);}};}(DOMParser));

/*
 * @description              Validate a HTML string
 * @param       String html  The HTML string to be validated 
 * @returns            null  If the string is not wellformed XML
 *                    false  If the string contains an unknown element
 *                     true  If the string satisfies both conditions
 */
function validateHTML(html) {
    var parser = new DOMParser()
      , d = parser.parseFromString('<?xml version="1.0"?>'+html,'text/xml')
      , allnodes;
    if (d.querySelector('parsererror')) {
        console.log('Not welformed HTML (XML)!');
        return null;
    } else {
        /* To use text/html, see https://stackoverflow.com/a/9251106/938089 */
        d = parser.parseFromString(html, 'text/html');
        allnodes = d.getElementsByTagName('*');
        for (var i=allnodes.length-1; i>=0; i--) {
            if (allnodes[i] instanceof HTMLUnknownElement) return false;
        }
    }
    return true; /* The document is syntactically correct, all tags are closed */
}

console.log(validateHTML('<div>'));  //  null, because of the missing close tag
console.log(validateHTML('<x></x>'));// false, because it's not a HTML element
console.log(validateHTML('<a></a>'));//  true, because the tag is closed,
                                     //       and the element is a HTML element

See revision 1 of this answer for an alternative to XML validation without the DOMParser.

Considerations

The current method completely ignores the doctype, for validation.
This method returns null for <input type="text">, while it's valid HTML5 (because the tag is not closed).
Conformance is not checked.

回复收藏 0 原文

~没有更多了~