如何以字符串形式获取整个文档 HTML?

发布于 2024-07-18 21:52:49 字数 105 浏览 9 评论 0 原文

JS 有没有办法以字符串的形式获取 html 标签内的整个 HTML?

document.documentElement.??

Is there a way in JS to get the entire HTML within the html tags, as a string?

document.documentElement.??

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(17

べ繥欢鉨o。 2024-07-25 21:52:50

使用document.documentElement

同样的问题在这里回答:
https://stackoverflow.com/a/7289396/2164160

Use document.documentElement.

Same Question answered here:
https://stackoverflow.com/a/7289396/2164160

鸠魁 2024-07-25 21:52:50

正确的方法实际上是:

webBrowser1.DocumentText

The correct way is actually:

webBrowser1.DocumentText

深陷 2024-07-25 21:52:50

您必须迭代文档 childNodes 并获取外部 HTML 内容。

在VBA中,它看起来像这样

For Each e In document.ChildNodes
    Put ff, , e.outerHTML & vbCrLf
Next e

使用this,允许您获取网页的所有元素,包括< !文档类型> 节点(如果存在)

You have to iterate through the document childNodes and getting the outerHTML content.

in VBA it looks like this

For Each e In document.ChildNodes
    Put ff, , e.outerHTML & vbCrLf
Next e

using this, allows you to get all elements of the web page including < !DOCTYPE > node if it exists

红衣飘飘貌似仙 2024-07-25 21:52:50

我只需要 doctype html 并且应该在 IE11、Edge 和 Chrome 中正常工作。 我使用下面的代码它工作正常。

function downloadPage(element, event) {
    var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);

    if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
        document.execCommand('SaveAs', '1', 'page.html');
        event.preventDefault();
    } else {
        if(isChrome) {
            element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
        }
        element.setAttribute('download', 'page.html');
    }
}

并在您的锚标记中像这样使用。

<a href="#" onclick="downloadPage(this,event);" download>Download entire page.</a>

示例

    function downloadPage(element, event) {
    	var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);
    
    	if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
    		document.execCommand('SaveAs', '1', 'page.html');
    		event.preventDefault();
    	} else {
    		if(isChrome) {
                element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
    		}
    		element.setAttribute('download', 'page.html');
    	}
    }
I just need doctype html and should work fine in IE11, Edge and Chrome. 

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<p>
<a href="#" onclick="downloadPage(this,event);"  download><h2>Download entire page.</h2></a></p>

<p>Some image here</p>

<p><img src="https://placeimg.com/250/150/animals"/></p>

I just need doctype html and should work fine in IE11, Edge and Chrome. I used below code it works fine.

function downloadPage(element, event) {
    var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);

    if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
        document.execCommand('SaveAs', '1', 'page.html');
        event.preventDefault();
    } else {
        if(isChrome) {
            element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
        }
        element.setAttribute('download', 'page.html');
    }
}

and in your anchor tag use like this.

<a href="#" onclick="downloadPage(this,event);" download>Download entire page.</a>

Example

    function downloadPage(element, event) {
    	var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);
    
    	if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
    		document.execCommand('SaveAs', '1', 'page.html');
    		event.preventDefault();
    	} else {
    		if(isChrome) {
                element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
    		}
    		element.setAttribute('download', 'page.html');
    	}
    }
I just need doctype html and should work fine in IE11, Edge and Chrome. 

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<p>
<a href="#" onclick="downloadPage(this,event);"  download><h2>Download entire page.</h2></a></p>

<p>Some image here</p>

<p><img src="https://placeimg.com/250/150/animals"/></p>

会傲 2024-07-25 21:52:49

您可以

new XMLSerializer().serializeToString(document)

在比 IE 9 更新的浏览器中

执行此操作,请参阅 https://caniuse.com/xml-serializer

You can do

new XMLSerializer().serializeToString(document)

in browsers newer than IE 9

See https://caniuse.com/xml-serializer

倒带 2024-07-25 21:52:49

我尝试了各种答案以查看返回的内容。 我正在使用最新版本的 Chrome。

建议 document.documentElement.innerHTML; 返回 ; ...

Gaby 的建议 document.getElementsByTagName('html')[0].innerHTML; 返回了相同的结果。

建议 document.documentElement.outerHTML; 返回 ...
这是除了“文档类型”之外的所有内容。

您可以使用 document.doctype; 检索 doctype 对象,这将返回一个对象,而不是字符串,因此,如果您需要将所有 doctype 的详细信息提取为字符串(包括 HTML5),则如下所述: 使用 Javascript 获取 HTML 的 DocType 作为字符串

I只需要 HTML5,因此以下内容足以让我创建整个文档:

alert('' + '\n' + document.documentElement.outerHTML);

I tried the various answers to see what is returned. I'm using the latest version of Chrome.

The suggestion document.documentElement.innerHTML; returned <head> ... </body>

Gaby's suggestion document.getElementsByTagName('html')[0].innerHTML; returned the same.

The suggestion document.documentElement.outerHTML; returned <html><head> ... </body></html>
which is everything apart from the 'doctype'.

You can retrieve the doctype object with document.doctype; This returns an object, not a string, so if you need to extract the details as strings for all doctypes up to and including HTML5 it is described here: Get DocType of an HTML as string with Javascript

I only wanted HTML5, so the following was enough for me to create the whole document:

alert('<!DOCTYPE HTML>' + '\n' + document.documentElement.outerHTML);

怎言笑 2024-07-25 21:52:49

使用 元素>document.documentElement 然后获取其 .innerHTML

const txt = document.documentElement.innerHTML;
alert(txt);

或其 .outerHTML 也可以获取 标签

const txt = document.documentElement.outerHTML;
alert(txt);

Get the root <html> element with document.documentElement then get its .innerHTML:

const txt = document.documentElement.innerHTML;
alert(txt);

or its .outerHTML to get the <html> tag as well

const txt = document.documentElement.outerHTML;
alert(txt);
我偏爱纯白色 2024-07-25 21:52:49

我相信 document.documentElement.outerHTML 应该为您返回该值。

根据 MDNouterHTML 是Firefox 11、Chrome 0.2、Internet Explorer 4.0、Opera 7、Safari 1.3、Android、Firefox Mobile 11、IE Mobile、Opera Mobile 和 Safari Mobile 支持。 outerHTML 位于 DOM 解析和序列化 规范中。

outerHTML 属性上的 MSDN 页面 指出 IE 5+ 支持它。 Colin 的答案链接到 W3C quirksmode 页面,该页面提供了跨浏览器兼容性的良好比较(也适用于其他 DOM 功能)。

I believe document.documentElement.outerHTML should return that for you.

According to MDN, outerHTML is supported in Firefox 11, Chrome 0.2, Internet Explorer 4.0, Opera 7, Safari 1.3, Android, Firefox Mobile 11, IE Mobile, Opera Mobile, and Safari Mobile. outerHTML is in the DOM Parsing and Serialization specification.

The MSDN page on the outerHTML property notes that it is supported in IE 5+. Colin's answer links to the W3C quirksmode page, which offers a good comparison of cross-browser compatibility (for other DOM features too).

提赋 2024-07-25 21:52:49

你也可以这样做:

document.getElementsByTagName('html')[0].innerHTML

你不会得到 Doctype 或 html 标签,但其他一切......

You can also do:

document.getElementsByTagName('html')[0].innerHTML

You will not get the Doctype or html tag, but everything else...

獨角戲 2024-07-25 21:52:49
document.documentElement.innerHTML
document.documentElement.innerHTML
高冷爸爸 2024-07-25 21:52:49

要获取 ... 之外的内容,最重要的是 声明,您可以步行通过 document.childNodes,将每个节点转换为字符串:

const html = [...document.childNodes]
    .map(node => nodeToString(node))
    .join('\n') // could use '' instead, but whitespace should not matter.

function nodeToString(node) {
    switch (node.nodeType) {
        case node.ELEMENT_NODE:
            return node.outerHTML
        case node.TEXT_NODE:
            // Text nodes should probably never be encountered, but handling them anyway.
            return node.textContent
        case node.COMMENT_NODE:
            return `<!--${node.textContent}-->`
        case node.DOCUMENT_TYPE_NODE:
            return doctypeToString(node)
        default:
            throw new TypeError(`Unexpected node type: ${node.nodeType}`)
    }
}

我将此代码发布为 document-outerhtml 在 npm 上。


编辑注意上面的代码依赖于函数doctypeToString; 其实现如下(下面的代码在 npm 上发布为 doctype-to-string< /a>):

function doctypeToString(doctype) {
    if (doctype === null) {
        return ''
    }
    // Checking with instanceof DocumentType might be neater, but how to get a
    // reference to DocumentType without assuming it to be available globally?
    // To play nice with custom DOM implementations, we resort to duck-typing.
    if (!doctype
        || doctype.nodeType !== doctype.DOCUMENT_TYPE_NODE
        || typeof doctype.name !== 'string'
        || typeof doctype.publicId !== 'string'
        || typeof doctype.systemId !== 'string'
    ) {
        throw new TypeError('Expected a DocumentType')
    }
    const doctypeString = `<!DOCTYPE ${doctype.name}`
        + (doctype.publicId ? ` PUBLIC "${doctype.publicId}"` : '')
        + (doctype.systemId
            ? (doctype.publicId ? `` : ` SYSTEM`) + ` "${doctype.systemId}"`
            : ``)
        + `>`
    return doctypeString
}

To also get things outside the <html>...</html>, most importantly the <!DOCTYPE ...> declaration, you could walk through document.childNodes, turning each into a string:

const html = [...document.childNodes]
    .map(node => nodeToString(node))
    .join('\n') // could use '' instead, but whitespace should not matter.

function nodeToString(node) {
    switch (node.nodeType) {
        case node.ELEMENT_NODE:
            return node.outerHTML
        case node.TEXT_NODE:
            // Text nodes should probably never be encountered, but handling them anyway.
            return node.textContent
        case node.COMMENT_NODE:
            return `<!--${node.textContent}-->`
        case node.DOCUMENT_TYPE_NODE:
            return doctypeToString(node)
        default:
            throw new TypeError(`Unexpected node type: ${node.nodeType}`)
    }
}

I published this code as document-outerhtml on npm.


edit Note the code above depends on a function doctypeToString; its implementation could be as follows (code below is published on npm as doctype-to-string):

function doctypeToString(doctype) {
    if (doctype === null) {
        return ''
    }
    // Checking with instanceof DocumentType might be neater, but how to get a
    // reference to DocumentType without assuming it to be available globally?
    // To play nice with custom DOM implementations, we resort to duck-typing.
    if (!doctype
        || doctype.nodeType !== doctype.DOCUMENT_TYPE_NODE
        || typeof doctype.name !== 'string'
        || typeof doctype.publicId !== 'string'
        || typeof doctype.systemId !== 'string'
    ) {
        throw new TypeError('Expected a DocumentType')
    }
    const doctypeString = `<!DOCTYPE ${doctype.name}`
        + (doctype.publicId ? ` PUBLIC "${doctype.publicId}"` : '')
        + (doctype.systemId
            ? (doctype.publicId ? `` : ` SYSTEM`) + ` "${doctype.systemId}"`
            : ``)
        + `>`
    return doctypeString
}

混吃等死 2024-07-25 21:52:49

可能仅适用于 IE:

>     webBrowser1.DocumentText

适用于 1.0 以上的 FF:

//serialize current DOM-Tree incl. changes/edits to ss-variable
var ns = new XMLSerializer();
var ss= ns.serializeToString(document);
alert(ss.substr(0,300));

可以在 FF 中工作。 (显示源文本开头的前 300 个字符,主要是 doctype-def。)

但请注意,FF 的正常“另存为”对话框可能不会保存页面的当前状态,而是保存页面的当前状态。最初加载 X/h/tml-source-text !!
(将 ss 后置到某个临时文件并重定向到该临时文件可能会提供可保存的源文本,其中包含之前对其进行的更改/编辑。)

尽管 FF 令人惊讶的是“返回”的良好恢复和良好的状态包含/values on“另存为...”用于类似输入的字段、文本区域等,而不是 contenteditable/designMode 中的元素...

如果不是 xhtml- resp。 xml 文件(mime 类型,不仅仅是文件扩展名!),可以使用 document.open/write/close 来设置 appr。 内容到源层,该内容将通过 FF 的文件/保存菜单保存在用户的保存对话框中。
看:
http://www.w3.org/MarkUp/2004/xhtml-faq# docwrite 分别。

https://developer.mozilla.org/en-US/ docs/Web/API/document.write

对 X(ht)ML 的问题中立,尝试使用“view-source:http://...”作为 (script) 的 src-attrib 的值-made!?) iframe, - 访问 FF 中的 iframes 文档:

.contentDocument,请参阅 google“mdn contentDocument”了解 appr。 成员,例如“textContent”。
“几年前就知道了,但我不想爬着去拿。” 如果仍然有紧急需要,请提及这一点,我必须深入研究......

PROBABLY ONLY IE:

>     webBrowser1.DocumentText

for FF up from 1.0:

//serialize current DOM-Tree incl. changes/edits to ss-variable
var ns = new XMLSerializer();
var ss= ns.serializeToString(document);
alert(ss.substr(0,300));

may work in FF. (Shows up the VERY FIRST 300 characters from the VERY beginning of source-text, mostly doctype-defs.)

BUT be aware, that the normal "Save As"-Dialog of FF MIGHT NOT save the current state of the page, rather the originallly loaded X/h/tml-source-text !!
(a POST-up of ss to some temp-file and redirect to that might deliver a saveable source-text WITH the changes/edits prior made to it.)

Although FF surprises by good recovery on "back" and a NICE inclusion of states/values on "Save (as) ..." for input-like FIELDS, textarea etc. , not on elements in contenteditable/ designMode...

If NOT a xhtml- resp. xml-file (mime-type, NOT just filename-extension!), one may use document.open/write/close to SET the appr. content to the source-layer, that will be saved on user's save-dialog from the File/Save menue of FF.
see:
http://www.w3.org/MarkUp/2004/xhtml-faq#docwrite resp.

https://developer.mozilla.org/en-US/docs/Web/API/document.write

Neutral to questions of X(ht)ML, try a "view-source:http://..." as the value of the src-attrib of an (script-made!?) iframe, - to access an iframes-document in FF:

<iframe-elementnode>.contentDocument, see google "mdn contentDocument" for appr. members, like 'textContent' for instance.
'Got that years ago and no like to crawl for it. If still of urgent need, mention this, that I got to dive in ...

客…行舟 2024-07-25 21:52:49
document.documentElement.outerHTML
document.documentElement.outerHTML
护你周全 2024-07-25 21:52:49

我使用 outerHTML 作为元素(主要的 容器),使用 XMLSerializer 作为其他任何内容,包括 ; 容器外部的随机注释或任何其他可能存在的内容。 似乎 元素外部没有保留空格,因此我默认使用 sep="\n" 添加换行符。

function get_document_html(sep="\n") {
    let html = "";
    let xml = new XMLSerializer();
    for (let n of document.childNodes) {
        if (n.nodeType == Node.ELEMENT_NODE)
            html += n.outerHTML + sep;
        else
            html += xml.serializeToString(n) + sep;
    }
    return html;
}

console.log(get_document_html().slice(0, 200));

I am using outerHTML for elements (the main <html> container), and XMLSerializer for anything else including <!DOCTYPE>, random comments outside the <html> container, or whatever else might be there. It seems that whitespace isn't preserved outside the <html> element, so I'm adding newlines by default with sep="\n".

function get_document_html(sep="\n") {
    let html = "";
    let xml = new XMLSerializer();
    for (let n of document.childNodes) {
        if (n.nodeType == Node.ELEMENT_NODE)
            html += n.outerHTML + sep;
        else
            html += xml.serializeToString(n) + sep;
    }
    return html;
}

console.log(get_document_html().slice(0, 200));

日暮斜阳 2024-07-25 21:52:49

我总是使用

document.getElementsByTagName('html')[0].innerHTML

可能不是正确的方式,但当我看到它时我能理解它。

I always use

document.getElementsByTagName('html')[0].innerHTML

Probably not the right way but I can understand it when I see it.

哑剧 2024-07-25 21:52:49

使用查询选择器

const html = document.querySelector("html").outerHTML;
console.log(html)

Using querySelector

const html = document.querySelector("html").outerHTML;
console.log(html)

神回复 2024-07-25 21:52:49

如果您想获取 DOCTYPE 之外的所有内容,这将起作用:

document.getElementsByTagName('html')[0].outerHTML;

或者如果您也想要 doctype:

new XMLSerializer().serializeToString(document.doctype) + document.getElementsByTagName('html')[0].outerHTML;

This would work if you want to get everything outside the DOCTYPE:

document.getElementsByTagName('html')[0].outerHTML;

or this if you want the doctype too:

new XMLSerializer().serializeToString(document.doctype) + document.getElementsByTagName('html')[0].outerHTML;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文