使用 Javascript 将 HTML 字符串加载到 DOM 树中
我目前正在使用一个自动化框架,该框架将网页拉下来进行分析,然后将其呈现为字符串进行处理。 Rhino Javascript 引擎可帮助解析返回的网页。
看起来,如果字符串(这是一个完整的网页)可以以 DOM 表示形式加载,它将为解析和分析内容提供一个非常好的接口。
仅使用 Javascript,这是一个可能和/或可行的概念吗?
编辑:
我将分解问题以澄清:假设我在javascript中有一个包含这样的html的字符串:
var $mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';
是否有可能/现实地将其加载到dom对象中?
I'm currently working with an automation framework that is pulling a webpage down for analysis, which is then presented as a string for processing. The Rhino Javascript engine is available to assist in parsing the returned web page.
It seems that if the string (which is a complete webpage) can be loaded in a DOM representation, it would provide a very nice interface for parsing and analyzing content.
Using only Javascript, is this a possible and/or feasible concept?
Edit:
I'll decompose the question for clarify: Say I have an string in javascript that contains html like such:
var $mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';
is it possible/realistic to load it somehow into a dom object?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我接受 JonDavidJohn 的答案,因为它对于解决我的问题很有用,并认为包括这个附加答案,以供将来可能看到此问题的其他人使用。
看来,虽然 Javascript 允许将 html 字符串加载到 DOM 元素中,但 DOM 并不是核心 ECMAScript 的一部分,因此不可用于在 Rhino 下运行的脚本。
值得一提的是,Rhino 1.6 中实现的一个很好的替代方案是 E4X。虽然不是 DOM 实现,但它确实提供了概念上类似的功能。
I'm accepting JonDavidJohn's answer as it was useful in solving my problem, thought including this additional answer for others that may view this in the future.
It appears that while Javascript allows the loading of html strings into a DOM element, DOM is not part of core ECMAScript, and as such is not available to scripts running under Rhino.
As a side note worth mentioning, a good alternative that was implemented in Rhino 1.6 is E4X. While not a DOM implementation, it does provide for conceptually similar capabilities.
如果文档是 XHTML,则可以使用任何 XML 解析器对其进行解析。 E4X 可能会很好地完成这项工作,内置的 Java XML 解析接口也是如此。
env.js 库旨在模拟 Rhino 下的浏览器环境,但我相信您的文档也需要兼容 XHTML:
http://ejohn.org/blog/bringing-the-browser-to-the-server/
http://www.envjs.com/
然而,如果它是 HTML,那就更困难了,因为浏览器被设计为在如何解析标记方面极其宽松。请参阅此处,了解 Java 中的 HTML 解析器列表:
http://java-source.net/ open-source/html-parsers
这不是一个容易解决的问题。人们甚至通过 JNI 将 Mozilla Gecko 引擎嵌入到 Java 中,以使用其解析功能。
我建议您查看以下纯 Java 项目:
http://lobobrowser.org/cobra.jsp
Lobo 项目的目标是开发纯 Java Web 浏览器。这是一个非常有趣的项目,有很多东西,但我相信您可以在自己的应用程序中非常轻松地独立使用解析器,如以下链接所述:
http://lobobrowser.org/cobra/java-html-parser.jsp
If the document is XHTML, you can parse it with any XML parser. E4X would probably do the job nicely, as would the built-in Java XML parsing interfaces.
The env.js library is designed to emulate the browser environment under Rhino, but I believe your document also needs to be compliant XHTML:
http://ejohn.org/blog/bringing-the-browser-to-the-server/
http://www.envjs.com/
If it's HTML, however, it's more difficult, as browsers are designed to be extremely lenient in how markup is parsed. See here for a list of HTML parsers in Java:
http://java-source.net/open-source/html-parsers
This is not an easy problem to solve. People have gone so far as to embed the Mozilla Gecko engine in Java via JNI in order to use its parsing capabilities.
I would recommend you look into the following pure-Java project:
http://lobobrowser.org/cobra.jsp
The goal of the Lobo project is to develop a pure-Java web browser. It's a pretty interesting project, and there's a lot there, but I believe you could use the parser standalone quite easily in your own application, as described in the following link:
http://lobobrowser.org/cobra/java-html-parser.jsp
如果你有这个包含html的变量,你可以将它加载到DOM对象中,例如通过id。
if you have this variable that contains html, you can load it into a DOM object, for example, by id.